# Exploring Python SDK v2 in Azure Machine Learning

In this example we will explore the Python SDK (v1) and some functionalities for Machine Learning.
We will create:
- Install the required Python packages
- MLFlow tracking
- Get data from Azure storage
- Create a simple Py function
- Run the prediction models


In [2]:
# Instsalling required Py Packages
#!pip install azure-ai-ml
%pip install -r RequirementsPySDK.txt

Collecting pandas>=1.2.0
  Downloading pandas-1.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.2 MB)
[K     |████████████████████████████████| 12.2 MB 5.4 MB/s eta 0:00:01
Collecting Jinja2>=3.0
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 77.6 MB/s eta 0:00:01
Collecting MarkupSafe>=2.1.1
  Downloading MarkupSafe-2.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
[31mERROR: pyldavis 3.3.1 requires sklearn, which is not installed.[0m
[31mERROR: pandas-ml 0.6.1 requires enum34, which is not installed.[0m
[31mERROR: fbprophet 0.7.1 requires cmdstanpy==0.9.5, which is not installed.[0m
[31mERROR: tensorboard 2.2.2 has requirement google-auth<2,>=1.6.3, but you'll have google-auth 2.13.0 which is incompatible.[0m
[31mERROR: responsibleai 0.22.0 has requirement ipykernel<=6.6.0, but you'll have ipykernel 6.8.0 which is incompatible.[0m
[31mERROR: responsibleai 0.22.0 has requirement m

In [3]:
#import mlflow and set tracking and import Workspace
import mlflow
from azureml.core import Workspace

ws = Workspace.from_config()

mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())
mlflow.set_experiment("lightgbm-iris-toy-demo")
     

2022/12/07 05:12:34 INFO mlflow.tracking.fluent: Experiment with name 'lightgbm-iris-toy-demo' does not exist. Creating a new experiment.


<Experiment: artifact_location='', creation_time=1670389955391, experiment_id='cb09a125-b165-4a68-9525-fb606f6a5837', last_update_time=None, lifecycle_stage='active', name='lightgbm-iris-toy-demo', tags={}>

In [4]:
# get the data from local Azure Blob store
# this is a general location for the data. 
data_uri = "https://azuremlexamples.blob.core.windows.net/datasets/iris.csv"

In [5]:
#convert CSV to pandas and check the first couple of rows
import pandas as pd

df = pd.read_csv(data_uri)
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Now, let's create some Python functions for machine learning. These will be:
- preprocess_data
- train_model
- evaluate_model

I am importing also sklearn metrics functions - log_loss and accuracy_score, as well as the sklearn train_test split functions :-)


In [6]:
# imports
import time

import lightgbm as lgb

from sklearn.metrics import log_loss, accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# define functions
def preprocess_data(df):
    X = df.drop(["species"], axis=1)
    y = df["species"]

    enc = LabelEncoder()
    y = enc.fit_transform(y)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    return X_train, X_test, y_train, y_test, enc


def train_model(params, num_boost_round, X_train, X_test, y_train, y_test):
    t1 = time.time()
    train_data = lgb.Dataset(X_train, label=y_train)
    test_data = lgb.Dataset(X_test, label=y_test)
    model = lgb.train(
        params,
        train_data,
        num_boost_round=num_boost_round,
        valid_sets=[test_data],
        valid_names=["test"],
    )
    t2 = time.time()

    return model, t2 - t1


def evaluate_model(model, X_test, y_test):
    y_proba = model.predict(X_test)
    y_pred = y_proba.argmax(axis=1)
    loss = log_loss(y_test, y_proba)
    acc = accuracy_score(y_test, y_pred)

    return loss, acc

  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)


We will define the parameters for Lightgbm and create a wrapper for MLFlow to start logging and evaluating the models.

In [7]:

# preprocess data
X_train, X_test, y_train, y_test, enc = preprocess_data(df)

# set training parameters
params = {
    "objective": "multiclass",
    "num_class": 3,
    "learning_rate": 0.1,
    "metric": "multi_logloss",
    "colsample_bytree": 1.0,
    "subsample": 1.0,
    "seed": 42,
}

num_boost_round = 32

# start run
run = mlflow.start_run()

# enable automatic logging
mlflow.lightgbm.autolog()

# train model
model, train_time = train_model(
    params, num_boost_round, X_train, X_test, y_train, y_test
)
mlflow.log_metric("training_time", train_time)

# evaluate model
loss, acc = evaluate_model(model, X_test, y_test)
mlflow.log_metrics({"loss": loss, "accuracy": acc})
     

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 91
[LightGBM] [Info] Number of data points in the train set: 120, number of used features: 4
[LightGBM] [Info] Start training from score -1.098612
[LightGBM] [Info] Start training from score -1.073920
[LightGBM] [Info] Start training from score -1.123930
[1]	test's multi_logloss: 0.930558
[2]	test's multi_logloss: 0.795536
[3]	test's multi_logloss: 0.68756
[4]	test's multi_logloss: 0.593833
[5]	test's multi_logloss: 0.51883
[6]	test's multi_logloss: 0.454422
[7]	test's multi_logloss: 0.401051
[8]	test's multi_logloss: 0.353053
[9]	test's multi_logloss: 0.313256
[10]	test's multi_logloss: 0.276926
[11]	test's multi_logloss: 0.247315
[12]	test's multi_logloss: 0.221442
[13]	test's multi_logloss: 0.199252
[14]	test's multi_logloss: 0.177485
[15]	test's multi_logloss: 0.160641
[16]	test's multi_logloss: 0.144921
[17]	test's multi_logloss: 0.



In [10]:
#save the logs and capture the results
mlflow.log_artifact("PySDK.ipynb")

In [11]:
# end run and mark it as complete
mlflow.end_run()