# Get Started with Notebooks in Azure Machine Learning

Azure Machine Learning is a cloud-based service for creating and managing machine learning solutions. It's designed to help data scientists and machine learning engineers leverage their existing data processing and model development skills and frameworks, and scale their workloads to the cloud.

A lot of data science and machine learning work is accomplished in notebooks like this one. Notebooks consist of *cells*, some of which (like the one containing this text) are used for notes, graphics, and other content  usually written using *markdown*; while others (like the cell below this one) contain code that you can run interactively within the notebook.

## The Azure Machine Learning Python SDK

You can run pretty much any Python code in a notebook, provided the required Python packages are installed in the environment where you're running it. In this case, you're running the notebook in a *Conda* environment on an Azure Machine Learning compute instance. This environment is installed in the compute instance by default, and contains common Python packages that data scientists typically work with. It also includes the Azure Machine Learning Python SDK, which is a Python package that enables you to write code that uses resources in your Azure Machine Learning workspace.

Run the cell below to import the **azureml-core** package and checking the version of the SDK that is installed.

In [1]:
import azureml.core
print("Ready to use Azure ML", azureml.core.VERSION)

Ready to use Azure ML 1.47.0


## Connect to your workspace

All experiments and associated resources are managed within your Azure Machine Learning workspace. You can connect to an existing workspace, or create a new one using the Azure Machine Learning SDK.

In most cases, you should store workspace connection information in a JSON configuration file. This makes it easier to connect without needing to remember details like your Azure subscription ID. You can download the JSON configuration file from the blade for your workspace in the Azure portal or from the workspace details pane in Azure Machine Learning studio, but if you're using a compute instance within your workspace, the configuration file has already been downloaded to the root folder.

The code below uses the configuration file to connect to your workspace.

> **Note**: The first time you connect to your workspace in a notebook session, you may be prompted to sign into Azure by clicking the `https://microsoft.com/devicelogin` link,  entering an automatically generated code, and signing into Azure. After you have successfully signed in, you can close the browser tab that was opened and return to this notebook.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, "loaded")

## View Azure Machine Learning resources in the workspace

Now that you have a connection to your workspace, you can work with the resources. For example, you can use the following code to enumerate the compute resources in your workspace.

In [None]:
print("Compute Resources:")
for compute_name in ws.compute_targets:
    compute = ws.compute_targets[compute_name]
    print("\t", compute.name, ':', compute.type)

## Loading data 
Loading data into a pandas dataframe works just like in any other notebook. We load the pandas package, and make sure we have the right path. In this case, we already have a data set stored in the data folder.

In [12]:
import pandas as pd 
df = pd.read_csv("data/diabetes.csv", index_col = 0)
df.head()

Unnamed: 0_level_0,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1354778,0,171,80,34,23,43.509726,1.213191,21,0
1147438,8,92,93,47,36,21.240576,0.158365,23,0
1640031,7,115,47,52,35,41.511523,0.079019,23,0
1883350,9,103,78,25,304,29.582192,1.28287,43,1
1424119,1,85,59,27,35,42.604536,0.549542,22,0


In [13]:
df.columns

Index(['Pregnancies', 'PlasmaGlucose', 'DiastolicBloodPressure',
       'TricepsThickness', 'SerumInsulin', 'BMI', 'DiabetesPedigree', 'Age',
       'Diabetic'],
      dtype='object')

## Training our first model
Training a model is also straight forward! Do note: the example below is not about how to build good models. Rather we are going to expand on this example to highlight some features of Azure Machine Learning Studio.

In [17]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# load data set
print("Loading Data...")
df = pd.read_csv('data/diabetes.csv')

# Separate features and labels, selecting only some featyres
X, y = df[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, df['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Set regularization hyperparameter
reg = 0.1

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))


Loading Data...
Training a logistic regression model with regularization rate of 0.1
AUC: 0.8484357430717946


## Keeping track of things
You are perhaps already familiar with versioning your code using Git, Azure Machine Learning Studio offers integration with Git and we will not cover it here. (If you are curious, can you spot top right how to launch a terminal attached to your compute? From the terminal you can access Git). Next to versioning your code, you can also version your data and your model output. Tracking code, datasets and model version is part of the MLOps workflow of making results reproducible. 

Machine Learning Studio lets you version your datasets and model output straight out of the box. Lets adjust our code example above to do this tracking.
We will:

- Add our dataset from local file storage to a datastore attached to our Machine Learning Studio instance
- Show how to read our dataset from this central data store
- Show how we can register our datasets in our data store
- Show how we can register our model object

First: keeping tracking of our data

In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

from azureml.core import Dataset,Workspace
from azureml.data.datapath import DataPath

# Get the default datastore
ws = Workspace.from_config()
default_ds = ws.get_default_datastore()

# This will upload all content from the folder data into the folder diabetes-data in our default data store
Dataset.File.upload_directory(src_dir='data',
                              target=DataPath(default_ds, 'diabetes-data/')
                              )

# Create a tabular dataset from the path on the datastore (this may take a short while)
tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-data/diabetes.csv'))


# Register a data set
try:
    tab_data_set = tab_data_set.register(workspace=ws, 
                                        name='diabetes dataset',
                                        description='diabetes data',
                                        tags = {'format':'CSV'},
                                        create_new_version=True)
except Exception as ex:
    print(ex)


# Lit all available data sets
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, 'version', dataset.version)

# Read a specific, registered, data set with optional version argument
dataset = Dataset.get_by_name(ws, "diabetes dataset", version = 'latest')
df = dataset.to_pandas_dataframe()
df.head()


Validating arguments.
Arguments validated.
Uploading file to diabetes-data/
Uploading an estimated of 3 files
Target already exists. Skipping upload for diabetes-data/.amlignore
Target already exists. Skipping upload for diabetes-data/.amlignore.amltmp
Target already exists. Skipping upload for diabetes-data/diabetes.csv
Uploaded 0 files
Creating new dataset
Datasets:
	 diabetes dataset version 1


Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0


Combining it with building a model and registering our model. Note that we introduce also a new concept here, we create an experiment run to keep track of our experiment (i.e. training job).

In [13]:
import pandas as pd
import numpy as np
import os 
import joblib
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

from azureml.core import Dataset, Experiment, Model, Workspace

# load data set
print("Loading Data...")
ws = Workspace.from_config()
dataset = Dataset.get_by_name(ws, "diabetes dataset", version = 'latest')
df = dataset.to_pandas_dataframe()

# We need to start our experiment
experiment = Experiment(workspace=ws, name="diabetes-experiment-frank")

# Start logging data from the experiment, obtaining a reference to the experiment run
run = experiment.start_logging()
print("Starting experiment:", experiment.name)

# Separate features and labels, selecting only some featyres
X, y = df[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, df['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Set regularization hyperparameter
reg = 0.1

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# Calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))

# Let's save the AUC score of this run:
run.log('AUC', float(auc))

# End the experiment
run.complete()

# Save the trained model in the outputs folder
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

# Register the model
Model.register(workspace=ws, model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
                   tags={'Training context':'Jupyter Notebook'},
                   model_framework=Model.Framework.SCIKITLEARN,
                   model_framework_version=sklearn.__version__,
                   properties={"AUC":float(auc)},
                   datasets = [('Full data set',dataset)])

Model(workspace=Workspace.create(name='fbutersml', subscription_id='004164f2-1c28-44aa-882e-3dfd3bd634e0', resource_group='fbuters-rg'), name=diabetes_model, id=diabetes_model:3, version=3, tags={'Training context': 'Jupyter Notebook'}, properties={'AUC': '0.8484392258321308'})

In [14]:
# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

diabetes_model version: 3
	 Training context : Jupyter Notebook
	 AUC : 0.8484392258321308


