In [1]:
!pip install azureml-core --user



In [2]:
# https://docs.microsoft.com/en-us/python/api/overview/azure/ml/install?view=azure-ml-py
# how to install Azure ML SDK on local notebook

### Work with Datastores

Data is the foundation on which machine learning models are built. Managing data centrally in the cloud, and making it accessible to teams of data scientists who are running experiments and training models on multiple workstations and compute targets is an important part of any professional data science solution.

In this notebook, you'll explore two Azure Machine Learning objects for working with data: datastores, and datasets.

Install the Azure Machine Learning SDK
The Azure Machine Learning SDK is updated frequently. Run the following cell to upgrade to the latest release, along with the additional package to support notebook widgets.

Ref: https://github.com/MicrosoftLearning/DP100/blob/master/04A%20-%20Working%20with%20Datastores.ipynb

In [3]:
import azureml.core
print(azureml.core.VERSION)

1.41.0


In [4]:
from azureml.core import Workspace, Datastore, Dataset
from azureml.data.datapath import DataPath

ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

Ready to use Azure ML 1.41.0 to work with demo-aml-wrkspace


### Work with datastores

In Azure ML, datastores are references to storage locations, such as Azure Storage blob containers. Every workspace has a default datastore - usually the Azure storage blob container that was created with the workspace. If you need to work with data that is stored in different locations, you can add custom datastores to your workspace and set any of them to be the default.

#### View datastores
Run the following code to determine the datastores in your workspace:

In [5]:
# Get the default datastore
default_ds = ws.get_default_datastore()

# Enumerate all datastores, indicating which is the default
for ds_name in ws.datastores:
    print(ds_name, "- Default =", ds_name == default_ds.name)

azureml_globaldatasets - Default = False
workspaceworkingdirectory - Default = False
workspaceartifactstore - Default = False
workspacefilestore - Default = False
workspaceblobstore - Default = True


You can also view and manage datastores in your workspace on the Datastores page for your workspace in Azure Machine Learning studio.

### Upload data to a datastore

Now that you have determined the available datastores, you can upload files from your local file system to a datastore so that it will be accessible to experiments running in the workspace, regardless of where the experiment script is actually being run.

In [6]:
default_ds.upload_files(files=['./mydata/hepatitis_data.csv'], # Upload the diabetes csv files in /data
                       target_path='hepc-datastore/', # Put it in a folder path in the datastore
                       overwrite=True, # Replace existing files of the same name
                       show_progress=True)

"datastore.upload_files" is deprecated after version 1.0.69. Please use "FileDatasetFactory.upload_directory" instead. See Dataset API change notice at https://aka.ms/dataset-deprecation.


Uploading an estimated of 1 files
Uploading ./mydata/hepatitis_data.csv
Uploaded ./mydata/hepatitis_data.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_acf31cf7a46f47209ece138fd0ed9fd0

### Create a tabular dataset
Let's create a dataset from the diabetes data you uploaded to the datastore, and view the first 20 records. In this case, the data is in a structured format in a CSV file, so we'll use a tabular dataset.

In [7]:
!pip install azureml-dataset-runtime --user



In [8]:
from azureml.core import Dataset

# Get the default datastore
default_ds = ws.get_default_datastore()

#Create a tabular dataset from the path on the datastore (this may take a short while)
tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'hepc-datastore/hepatitis_data.csv'))

# Display the first 20 rows as a Pandas dataframe
tab_data_set.take(20).to_pandas_dataframe()

Unnamed: 0,Column1,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,1,0=Blood Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106,12.1,69.0
1,2,0=Blood Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74,15.6,76.5
2,3,0=Blood Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86,33.2,79.3
3,4,0=Blood Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80,33.8,75.7
4,5,0=Blood Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76,29.9,68.7
5,6,0=Blood Donor,32,m,41.6,43.3,18.5,19.7,12.3,9.92,6.05,111,91.0,74.0
6,7,0=Blood Donor,32,m,46.3,41.3,17.5,17.8,8.5,7.01,4.79,70,16.9,74.5
7,8,0=Blood Donor,32,m,42.2,41.9,35.8,31.1,16.1,5.82,4.6,109,21.5,67.1
8,9,0=Blood Donor,32,m,50.9,65.5,23.2,21.2,6.9,8.69,4.1,83,13.7,71.3
9,10,0=Blood Donor,32,m,42.4,86.3,20.3,20.0,35.2,5.46,4.45,81,15.9,69.9


As you can see in the code above, it's easy to convert a tabular dataset to a Pandas dataframe, enabling you to work with the data using common python techniques.

### Create a file Dataset

The dataset you created is a tabular dataset that can be read as a dataframe containing all of the data in the structured files that are included in the dataset definition. This works well for tabular data, but in some machine learning scenarios you might need to work with data that is unstructured; or you may simply want to handle reading the data from files in your own code. To accomplish this, you can use a file dataset, which creates a list of file paths in a virtual mount point, which you can use to read the data in the files.

In [9]:
#Create a file dataset from the path on the datastore (this may take a short while)
file_data_set = Dataset.File.from_files(path=(default_ds, 'hepc-datastore/hepatitis_data.csv'))

# Get the files in the dataset
for file_path in file_data_set.to_path():
    print(file_path)

/hepatitis_data.csv


In [10]:
file_data_set

{
  "source": [
    "('workspaceblobstore', 'hepc-datastore/hepatitis_data.csv')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ]
}

### Register datasets
Now that you have created datasets that reference the hepatitis C data, you can register them to make them easily accessible to any experiment being run in the workspace.

We'll register the tabular dataset as hepatitis C dataset, and the file dataset as hepatitis C files.

In [11]:
# Register the tabular dataset
try:
    tab_data_set = tab_data_set.register(workspace=ws, 
                                        name='Hepatitis_C dataset',
                                        description='HepatitisC data',
                                        tags = {'format':'CSV'},
                                        create_new_version=True)
except Exception as ex:
    print(ex)

# Register the file dataset
try:
    file_data_set = file_data_set.register(workspace=ws,
                                            name='Hepatitis_C file dataset',
                                            description='HepatitisC files',
                                            tags = {'format':'CSV'},
                                            create_new_version=True)
except Exception as ex:
    print(ex)

print('Datasets registered')

Datasets registered


You can view and manage datasets on the Datasets page for your workspace in Azure Machine Learning studio. You cal also get a list of datasets from the workspace object:

In [12]:
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, 'version', dataset.version)

Datasets:
	 HepatitisC processed data version 1
	 Hepatitis_C file dataset version 1
	 Hepatitis_C dataset version 1
	 HepatitisC Preprocess dataset version 1
	 HepatitisC file dataset version 2
	 HepatitisC dataset version 5
	 amldataset1 version 1


The ability to version datasets enables you to redefine datasets without breaking existing experiments or pipelines that rely on previous definitions. By default, the latest version of a named dataset is returned, but you can retrieve a specific version of a dataset by specifying the version number, like this:

dataset_v1 = Dataset.get_by_name(ws, 'diabetes dataset', version = 1)

### Pre process Data set - Table dataset

In [13]:
# Display the first 20 rows as a Pandas dataframe
tab_data_df_prep = tab_data_set.to_pandas_dataframe()
tab_data_df_prep.shape

(615, 14)

In [14]:
tab_data_df_prep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 615 entries, 0 to 614
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Column1   615 non-null    int64  
 1   Category  615 non-null    object 
 2   Age       615 non-null    int64  
 3   Sex       615 non-null    object 
 4   ALB       614 non-null    float64
 5   ALP       597 non-null    float64
 6   ALT       614 non-null    float64
 7   AST       615 non-null    float64
 8   BIL       615 non-null    float64
 9   CHE       615 non-null    float64
 10  CHOL      615 non-null    object 
 11  CREA      565 non-null    float64
 12  GGT       615 non-null    float64
 13  PROT      614 non-null    float64
dtypes: float64(9), int64(2), object(3)
memory usage: 67.4+ KB


In [15]:
##Converting "Sex" columns , map male = 1, female = 2
tab_data_df_prep["Sex"] = tab_data_df_prep["Sex"].map({"m":'1', "f":'2'})

In [16]:
tab_data_df_prep.head()

Unnamed: 0,Column1,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,1,0=Blood Donor,32,1,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,2,0=Blood Donor,32,1,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
2,3,0=Blood Donor,32,1,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
3,4,0=Blood Donor,32,1,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,5,0=Blood Donor,32,1,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7


In [17]:
#Converting category column. This is target variable. hence will set "1=Hepatitis" = '1' 
#and rest of the categories to '0'. We are only predicting "Hepatitis C", hence we will create binary classified target
tab_data_df_prep['Category'].value_counts()

0=Blood Donor             533
3=Cirrhosis                30
1=Hepatitis                24
2=Fibrosis                 21
0s=suspect Blood Donor      7
Name: Category, dtype: int64

In [18]:
import numpy as np
tab_data_df_prep["Category"] = np.where(tab_data_df_prep['Category'] =="1=Hepatitis" , 1, 0)
print(tab_data_df_prep["Category"].value_counts())

0    591
1     24
Name: Category, dtype: int64


In [19]:
#Handle missing value
tab_data_df_prep.isnull().sum()

Column1      0
Category     0
Age          0
Sex          0
ALB          1
ALP         18
ALT          1
AST          0
BIL          0
CHE          0
CHOL         0
CREA        50
GGT          0
PROT         1
dtype: int64

In [20]:
tab_data_set_df_prep.info()

NameError: name 'tab_data_set_df_prep' is not defined

In [21]:
#We can impute them with mean value
tab_data_df_prep['ALP'] = tab_data_df_prep['ALP'].fillna(tab_data_df_prep['ALP'].mean())
tab_data_df_prep['ALB'] = tab_data_df_prep['ALB'].fillna(tab_data_df_prep['ALB'].mean())
tab_data_df_prep['ALT'] = tab_data_df_prep['ALT'].fillna(tab_data_df_prep['ALT'].mean())
tab_data_df_prep['PROT'] = tab_data_df_prep['PROT'].fillna(tab_data_df_prep['PROT'].mean())
tab_data_df_prep['CREA'] = tab_data_df_prep['CREA'].fillna(tab_data_df_prep['CREA'].mean())


In [22]:
#Handle missing value
tab_data_df_prep.isnull().sum()

Column1     0
Category    0
Age         0
Sex         0
ALB         0
ALP         0
ALT         0
AST         0
BIL         0
CHE         0
CHOL        0
CREA        0
GGT         0
PROT        0
dtype: int64

In [23]:
tab_data_df_prep[tab_data_df_prep['CHOL'] == 'NA']

Unnamed: 0,Column1,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
121,122,0,43,1,48.6,45.0,10.5,40.5,5.3,7.09,,63.0,25.1,70.0
319,320,0,32,2,47.4,52.5,19.1,17.1,4.6,10.19,,63.0,23.0,72.2
329,330,0,33,2,42.4,137.2,14.2,13.1,3.4,8.23,,48.0,25.7,74.4
413,414,0,46,2,42.9,55.1,15.2,29.8,3.6,8.37,,61.0,29.0,71.9
424,425,0,48,2,45.6,107.2,24.4,39.0,13.8,9.77,,88.0,38.0,75.1
433,434,0,48,2,46.8,93.3,10.0,23.2,4.3,12.41,,52.0,23.9,72.4
498,499,0,57,2,48.4,94.4,2.5,39.6,2.3,8.84,,82.0,6.4,76.8
584,585,0,75,2,36.0,68.28392,114.0,125.0,14.0,6.65,,57.0,177.0,72.0
590,591,0,46,1,20.0,68.28392,62.0,113.0,254.0,1.48,,114.0,138.0,72.044137
603,604,0,65,1,41.620195,68.28392,40.0,54.0,13.0,7.5,,70.0,107.0,79.0


In [24]:
tab_data_df_prep["CHOL"] = np.where(tab_data_df_prep['CHOL'] =="NA" , 0.1, tab_data_df_prep['CHOL'])


### Load the pre processed dataset in Azure datastore

 - Save dataframe to csv file in in local drive.
 - Then load the csv to Azure Datastore
 - Then create preprocessed dataset from datastore

In [25]:
#Save the pre processed daatset as csv file locally
tab_data_df_prep.to_csv("mydata/hepC_processed.csv")

In [26]:
default_ds.upload_files(files=['./mydata/hepC_processed.csv'], # Upload the diabetes csv files in /data
                       target_path='hepc-datastore/', # Put it in a folder path in the datastore
                       overwrite=True, # Replace existing files of the same name
                       show_progress=True)

Uploading an estimated of 1 files
Uploading ./mydata/hepC_processed.csv
Uploaded ./mydata/hepC_processed.csv, 1 files out of an estimated total of 1
Uploaded 1 files


$AZUREML_DATAREFERENCE_10fc1ce2144d4d3a82ef5f6dd1927213

In [27]:

#Create a tabular preprocessed dataset from the path on the datastore (this may take a short while)
tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'hepc-datastore/hepC_processed.csv'))

# Display the first 20 rows as a Pandas dataframe
tab_data_set.take(20).to_pandas_dataframe()

Unnamed: 0,Column1,Column1_1,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,0,1,0,32,1,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,1,2,0,32,1,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
2,2,3,0,32,1,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
3,3,4,0,32,1,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,4,5,0,32,1,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7
5,5,6,0,32,1,41.6,43.3,18.5,19.7,12.3,9.92,6.05,111.0,91.0,74.0
6,6,7,0,32,1,46.3,41.3,17.5,17.8,8.5,7.01,4.79,70.0,16.9,74.5
7,7,8,0,32,1,42.2,41.9,35.8,31.1,16.1,5.82,4.6,109.0,21.5,67.1
8,8,9,0,32,1,50.9,65.5,23.2,21.2,6.9,8.69,4.1,83.0,13.7,71.3
9,9,10,0,32,1,42.4,86.3,20.3,20.0,35.2,5.46,4.45,81.0,15.9,69.9


In [28]:
# Register the tabular dataset
try:
    tab_data_set = tab_data_set.register(workspace=ws, 
                                        name='HepatitisC processed data',
                                        description='HepatitisC processed data',
                                        tags = {'format':'CSV'},
                                        create_new_version=True)
except Exception as ex:
    print(ex)

### Train a model from a tabular dataset
Now that you have datasets, you're ready to start training models from them. You can pass datasets to scripts as inputs in the estimator being used to run the script.

Run the following two code cells to create:

A folder named diabetes_training_from_tab_dataset
A script that trains a classification model by using a tabular dataset that is passed to is as an argument.

In [29]:
import os

# Create a folder for the experiment files
experiment_folder = 'hepc_training_with_tabdataset'
os.makedirs(experiment_folder, exist_ok=True)
print(experiment_folder, 'folder created')

hepc_training_with_tabdataset folder created


In [30]:
%%writefile $experiment_folder/hepc_training.py
# Import libraries
import os
import argparse
from azureml.core import Run, Dataset
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Get the script arguments (regularization rate and training dataset ID)
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
parser.add_argument("--input-data", type=str, dest='training_dataset_id', help='training dataset')
args = parser.parse_args()

# Set regularization hyperparameter (passed as an argument to the script)
reg = args.reg_rate

# Get the experiment run context
run = Run.get_context()

# Get the training dataset
print("Loading Data...")
hepc = run.input_datasets['training_data'].to_pandas_dataframe()

# Separate features and labels
X, y = hepc[['Age','Sex','ALB','ALP','ALT','AST','BIL','CHE', 'CHOL', 'CREA', 'GGT', 'PROT']].values, hepc['Category'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/hepc_model.pkl')

run.complete()

Overwriting hepc_training_with_tabdataset/hepc_training.py


In [31]:
!pip install ruamel.yaml



In [32]:
!pip install azureml.widgets



 - Note: In the script, the dataset is passed as a parameter (or argument). In the case of a tabular dataset, this argument will contain the ID of the registered dataset; so you could write code in the script to get the experiment's workspace from the run context, and then get the dataset using its ID; like this:

run = Run.get_context()

ws = run.experiment.workspace

dataset = Dataset.get_by_id(ws, id=args.training_dataset_id)

diabetes = dataset.to_pandas_dataframe()


However, Azure Machine Learning runs automatically identify arguments that reference named datasets and add them to the run's input_datasets collection, so you can also retrieve the dataset from this collection by specifying its "friendly name" (which as you'll see shortly, is specified in the argument definition in the script run configuration for the experiment). This is the approach taken in the script above.


 - Now you can run a script as an experiment, defining an argument for the training dataset, which is read by the script.

 - Note: The Dataset class depends on some components in the azureml-dataprep package, which includes optional support for pandas that is used by the to_pandas_dataframe() method. So you need to include this package in the environment where the training experiment will be run.

In [33]:
from azureml.core import Experiment, ScriptRunConfig, Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.widgets import RunDetails


# Create a Python environment for the experiment
sklearn_env = Environment("sklearn-env")

# Ensure the required packages are installed (we need scikit-learn, Azure ML defaults, and Azure ML dataprep)
packages = CondaDependencies.create(conda_packages=['scikit-learn','pip'],
                                    pip_packages=['azureml-defaults','azureml-dataprep[pandas]'])
sklearn_env.python.conda_dependencies = packages

# Get the training dataset
hepc_ds = ws.datasets.get("HepatitisC processed data")

# Create a script config
script_config = ScriptRunConfig(source_directory=experiment_folder,
                              script='hepc_training.py',
                              arguments = ['--regularization', 0.1, # Regularizaton rate parameter
                                           '--input-data', hepc_ds.as_named_input('training_data')], # Reference to dataset
                              environment=sklearn_env) 

# submit the experiment
experiment_name = 'mslearn-train-hepc'
experiment = Experiment(workspace=ws, name=experiment_name)
run = experiment.submit(config=script_config)
RunDetails(run).show()
run.wait_for_completion()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'mslearn-train-hepc_1652454613_779bc1ea',
 'target': 'local',
 'status': 'Finalizing',
 'startTimeUtc': '2022-05-13T15:10:20.008029Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'local',
  'ContentSnapshotId': 'ecea6f98-a3dd-4530-b10f-493d84535b0b'},
 'inputDatasets': [{'dataset': {'id': 'f9bd2c30-90db-4a8f-9111-335c56a5a838'}, 'consumptionDetails': {'type': 'RunInput', 'inputName': 'training_data', 'mechanism': 'Direct'}}],
 'outputDatasets': [],
 'runDefinition': {'script': 'hepc_training.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--regularization',
   '0.1',
   '--input-data',
   'DatasetConsumptionConfig:training_data'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'local',
  'dataReferences': {},
  'data': {'training_data': {'dataLocation': {'dataset': {'id': 'f9bd2c30-90db-4a8f-9111-335c56a5a838',
      'name': 'HepatitisC processed data',
      'version': '1'},
     'dataP

 - Note: The --input-data argument passes the dataset as a named input that includes a friendly name for the dataset, which is used by the script to read it from the input_datasets collection in the experiment run. The string value in the --input-data argument is actually the registered dataset's ID. As an alternative approach, you could simply pass diabetes_ds.id, in which case the script can access the dataset ID from the script arguments and use it to get the dataset from the workspace, but not from the input_datasets collection.

The first time the experiment is run, it may take some time to set up the Python environment - subsequent runs will be quicker.

When the experiment has completed, in the widget, view the azureml-logs/70_driver_log.txt output log and the metrics generated by the run.

### Register the trained model
As with any training experiment, you can retrieve the trained model and register it in your Azure Machine Learning workspace.

In [34]:
from azureml.core import Model

run.register_model(model_path='outputs/hepc_model.pkl', model_name='hepatitisC_model',
                   tags={'Training context':'Tabular dataset'}, properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

hepatitisC_model version: 2
	 Training context : Tabular dataset
	 AUC : 0.9700000000000001
	 Accuracy : 0.9675675675675676


hepatitisC_model version: 1
	 Training context : Tabular dataset
	 AUC : 0.9700000000000001
	 Accuracy : 0.9675675675675676




### Train a model from a file dataset
You've seen how to train a model using training data in a tabular dataset; but what about a file dataset?

When you're using a file dataset, the dataset argument passed to the script represents a mount point containing file paths. How you read the data from these files depends on the kind of data in the files and what you want to do with it. In the case of the diabetes CSV files, you can use the Python glob module to create a list of files in the virtual mount point defined by the dataset, and read them all into Pandas dataframes that are concatenated into a single dataframe.

Run the following two code cells to create:

 - A folder named diabetes_training_from_file_dataset
 - A script that trains a classification model by using a file dataset that is passed to is as an input.

In [35]:
import os

# Create a folder for the experiment files
experiment_folder = 'diabetes_training_from_file_dataset'
os.makedirs(experiment_folder, exist_ok=True)
print(experiment_folder, 'folder created')

diabetes_training_from_file_dataset folder created


In [36]:
%%writefile $experiment_folder/hepc_training.py
# Import libraries
import os
import argparse
from azureml.core import Run, Dataset
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Get script arguments (rgularization rate and file dataset mount point)
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
parser.add_argument('--input-data', type=str, dest='dataset_folder', help='data mount point')
args = parser.parse_args()

# Set regularization hyperparameter (passed as an argument to the script)
reg = args.reg_rate

# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
print("Loading Data...")
data_path = run.input_datasets['training_files'] # Get the training data path from the input
# (You could also just use args.data_folder if you don't want to rely on a hard-coded friendly name)

# Read the files
all_files = glob.glob(data_path + "/*.csv")
diabetes = pd.concat((pd.read_csv(f) for f in all_files), sort=False)

# Separate features and labels
X, y = hepc[['Age','Sex','ALB','ALP','ALT','AST','BIL','CHE', 'CHOL', 'CREA', 'GGT', 'PROT']].values, hepc['Category'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/hepc_model.pkl')

run.complete()

Overwriting diabetes_training_from_file_dataset/hepc_training.py


Just as with tabular datasets, you can retrieve a file dataset from the input_datasets collection by using its friendly name. You can also retrieve it from the script argument, which in the case of a file dataset contains a mount path to the files (rather than the dataset ID passed for a tabular dataset).

Next we need to change the way we pass the dataset to the script - it needs to define a path from which the script can read the files. You can use either the as_download or as_mount method to do this. Using as_download causes the files in the file dataset to be downloaded to a temporary location on the compute where the script is being run, while as_mount creates a mount point from which the files can be streamed directly from the datasetore.

You can combine the access method with the as_named_input method to include the dataset in the input_datasets collection in the experiment run (if you omit this, for example by setting the argument to diabetes_ds.as_mount(), the script will be able to access the dataset mount point from the script arguments, but not from the input_datasets collection).