In this project, I created and optimized an ML pipeline.
I updated custom-coded model using standard Scikit-learn Logistic Regression—the hyperparameters and optimized using HyperDrive.
I also used AutoML to build and optimize a model on the same dataset, so that you can compare the results of the two methods.
I worked on three different use cases for which I'm attaching images of results
These models are then compared to an Azure AutoML run.
- ScriptRunConfig Class
- Configure and submit training runs
- HyperDriveConfig Class
- How to tune hyperparamters
For a given set of banking data predict whether a customer will be interested in borowing loan. Parameters considered to run load prediction are numeric, strings and boolean like age, job, marital, education, loan, default etc.
As described I ran three use cases first using HyperDrive and two using AutoML SDK
My conda jupyter notebook environment details. python version: 3.8.5, azureml version: 1.41.0, sklearn version: 1.0.2
I used Optum Azure account to run all my experiments and did not use Udacity Azure labs.
a) Using Scikit-Learn HyperDrive Hyperparameters:
Pipeline Architecture :
1. Connect to your Workspace using config.
2. Create Experiment within given workspace.
3. Add Compute Target
4. Specify Parameter sampling policy
5. Specify early terminating policy
6. Setup sklearn environment
7. Specify Script run config
8. Specify Hyperdrive config
9. Submit job to run
10. Gather performance and accuracy results.
Details of input parameters and results of this Experiment: This uses Logistic Regression
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state=0)
Sampling policy: RANDOM and Parameter space {"--C":["uniform",[0.1,0.4]],"--max_iter":["choice",[[50,100,200,250]]]}
Early termination policy: BANDIT with Properties {"evaluation_interval":1,"delay_evaluation":5,"slack_factor":0.2}
Results: Accuracy: 0.916 at Max iterations: 200 and Regularization Strength: 0.155
Benefits of the selected parameter
Lower regularization strength (C) and median number of iterations choice resulted in better accuracy.
Benefits of the early stopping policy
Any run that doesn't fall within the slack factor or slack amount of the evaluation metric with respect to the best performing run will be terminated.
Results: Accuracy: 0.918 with VotingEnsemble as best model selected
In this case I initially forgot to add test data validation_size to AutoMLConfig but this resulted in best performance overall.
In this case I added validation_size as 25% to AutoMLConfig. Results for the best fit model 'VotingEnsemble' were same.
Results: Accuracy: 0.918 with VotingEnsemble as best model selected
Using AutoML SDK pipeline resulted in slightly better performance of 0.918 accuracy compare to Scikit-learn Hyperdrive 0.916.
AutoML performed better due to number of iteration it could run and tune. Voting ensamble resulted in better accuracy, this wasn't an option for HyperDrive.
Lower regularization strength (C) and median number of iterations choice resulted in better accuracy for Hyperdrive.
Given the amount of processing cost and time overhead with AutoML, I would prefer HyperDrive option as accuracy for both cases is almost same.
I would prefer to run more use cases and would like to deep dive into NLP and recommendation systems.
I used Optum Azure account for all my lab work. Compute resource is stopped or at 0 node for clusters.
Infrastructure : A compute cluster is created using the Azure SDK and the ComputeTarget and AmlCompute objects.
Created Compute Cluster using AmlCompute and ComputeTarget objects
The delete method of the AmlCompute object is used to remove the cluster following training.
cpu_cluster.delete()
Cluster Delete
https://azure.github.io/azureml-sdk-for-r/reference/bandit_policy.html