# Optimizing an ML Pipeline in Azure

-----------------------------------------------------------------------------------------------------------------------------------------------------------
## Overview

This project is part of the Udacity Azure ML Nanodegree.
In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model.
This model is then compared to an Azure AutoML run.
-----------------------------------------------------------------------------------------------------------------------------------------------------------

## Summary
### **In 1-2 sentences, explain the problem statement:**

This dataset contains the bank marketing data including age, job, marital status, education, housing, loan, poutcome, etc. It is a classification problem where the goal is to predict whether the client will subscribe to a term deposit.    

### **In 1-2 sentences, explain the solution:**

For the data preprocessing part, all the categorical (text) feature columns are encoded into numerical forms.

For the ML training part, the best performing model was Logistic Regression with the following metrics:
- Regularization Strength: 0.2135
- Max Iterations: 50
- Accuracy: 0.911

Specifically, the hyperparameters of this best model were:
- C: 0.2135
- max_iter: 50

-----------------------------------------------------------------------------------------------------------------------------------------------------------

## Scikit-learn Pipeline

### **Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm.**

The ML SDK pipeline architecture includes:

- The dataset used for training was obtained from Azure dataset using a provided url:
- The dataset was then transformed to dataframe using pandas library, got cleaned by deleting any missing data if any, and hot edcoded columns. Finally, this dataframe was split into training and testing datasets.
- The classification algorithm used in this study was Logistic Regression from sklearn. Logistis Regression is a go-to binary classification algorithm with the benifits of: 1) easier to use and interpret, 2) very efficient to train, 3) making no assumptions about distributions of classes in feature space. 
- Hyperparameter tuning was performed by using HyperDriveConfig with specified regression training transcript and hyperparameters. As for Logistic Regression, C parameter controls the penality strength (i.e., inverse penlity so smaller value specifies more penality and strong regularization) and max_iter parameter means the maximum number of iterations taken for the solvers to converge. In this case, the search space for C parameter was uniformed distributed in 0.1 to 1 and for max_iter was chosen with discreate value as of 50, 100, 150, 200 (100 as default). 

### **What are the benefits of the parameter sampler you chose?**

The parameter sampler used here was RandomParameterSampling. It supports both of descrete and continuous hyperparameters and also supports early termination of low-performance runs. 

### **What are the benefits of the early stopping policy you chose?**

The early stopping policy chosen here was banditPolicy, which is based on slack factor and evaluation interval. Bandit terminates runs where the primary metric is beyond the specified slack factor compared to the best performing run, therefore helping save the total training time. It is mainly used for aggressive saving with relatively large truncation percentage. 


## AutoML
### **In 1-2 sentences, describe the model and hyperparameters generated by AutoML.**
With the metric "accuracy", the best classification model generated by AutoML was LightGBMClassifier (MaxAbsScaler) with an accuracy of 0.9148. Some other metrics include AUC weighted (0.94635) and F1 score weighted (0.91162). 

LightGBM is a gradient boosting framework using tree based learning algorithms. It offers several advantages: 1) faster trainng speed with higher efficiency; 2) lower memory usage; 3) better accuracy; 4) support of parallel and GPU learning; and 5) capable of handling large-scale data. 

Some of the hyperparameters of this best model were:

boosting_type: 'gbdt',
learning_rate: 0.1, 
min_child_samples= 20,
n_estimator = 100,
num_leaves = 31

In addition, based on the best model explanations, the feature importance was as "duration > nr.employed > cons.price.idx > euribor3m > emp.var.rate > cons.price.idx > age > poutcome".

## Pipeline comparison
### **Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one?**

#### Performance Comparision:

- For SDK pipeline "accuracy": 0.911
- For AutoML "accuracy":0.9148

The performance between these two is quite similar, with AutoML a bit better than AutoML. 

#### Architecture Comparision:

- SDK Hyperdrive hyperparameter tuning do need to specify detail information such as sampling method, early termination policy, primary metric name, primary metric goal, and some other hyperparameters. However, SDK Hyperdrive hyperparameter tuning process allows the user to control more details regarding the training algorithms and other details.
- On the other hand, AutoML does not need to specify any detail parameters but generally two of them: task (regression, clarrification, forcasting) and primary metric. AutoML performs missing data checking, imbalance data, feature engineering and n_cross_validation and thus provides the user with corresponding recommendations. Therefore it does all the heavy-liftingjob under the hood and everything can be automated. In addition, one great benifit AutoML provides is the model expanation, which can be leveraged by the user to obtain insights for global feature importance.

## Future work

### **What are some areas of improvement for future experiments? Why might these improvements help the model?**
- To perform feature engineering, for example, dimension reduction using PCA.
PCA enables to represent a multivariate data (i.e., high dimension) as smaller set of variables (i.e., small dimenstion) in order to observe trends, clusters, and outliers. This could uncover the relationships between observations and variables and among the variables. 

- To fix data imbalance.
The dataset is highly imbalanced and about 2473 / (2473 + 22076) * 100 = 10.1 % of the clients actually subscribed to a term deposit. Imbalanced data can lead to a falsely perceived positive effect of a model's accuracy because the input data has bias towards one class. Therefore, an data imbalance issue should be considered to fix for future experiments.We may try: 1) random undersampling, 2) oversampling with SMOTE, 3) a combination of under- and oversampling method using pipeline.

- To try to use other metrics such as 'AUC_weighted' to get better training . 
As for a highly unbalanced problem like this, AUC metric is very popular. AUC is acturally preferred over accuracy for binary classification therefore this metric is worth a try. 

- Specifically for SDK pipe tuning, consider to add n_cross_validation for model training with higher accuracy.


## Proof of cluster clean up
### **If you did not delete your compute cluster in the code, please complete this section. Otherwise, delete this section.**
### **Image of cluster marked for deletion**



![cluster_deleted](cluster-deleting-screenshot.png)