# Supervised Learning in Machine Learning - Classification

In this notebook you will get familiar with some classification and regression algorithms using PyCaret python package with preprocessed "Pima Indians Diabetes Database" dataset.

**classification:**
This is a binary classification problem in which the aim is to predict which patients have diabetes based on a number of measurements.




**Please create a report by addressing the provided questions(Q1-Q4) throughout the notebook.**




In [None]:
from pycaret.classification import *
from pycaret.utils.generic import check_metric

# for visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import pandas as pd

In [None]:
# load the preprocessed data

url = "https://raw.githubusercontent.com/thilinib/CBM101/main/E_Macine_Learning/data/preprocessed_diabetes.csv"
df = pd.read_csv(url)

In [None]:
# check the shape of the preprocessed data
df.shape

In [None]:
# check how preprocessed data looks like
df.head()

In [None]:
# Calculating the relative size of each class
N_TRUE = len(df[df['Outcome'] == 1])
N_FALSE = len(df) - N_TRUE

print('N_TRUE = {}'.format(N_TRUE))
print('N_FALSE = {}'.format(N_FALSE))
print('N_FALSE fraction = {:.3f}'.format(N_FALSE/(N_FALSE+N_TRUE)))

67% of the examples do not have diabetes, which will be our baseline for accuracy score of the classifiers.

#Splitting data into a Training set and a Test set
Pycaret will do the splitting automatically.If you use scikit-learn instead you need to do this by yourself e.g. using scikit-learn's `train_test_split()` -function.

# Setting up Environment in PyCaret

# Classification in Pycaret

`setup()` is Pycaret's main function and it needs to be run before executing any other function in pycaret. The `setup()` function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment.

We'll set normalize and transformation *True* for automatic preprocessing.

In [None]:
# `session_id` parameter is equivalent to ‘random_state’ in scikit-learn. Let's use 42 for reproducibility.
s = setup(df, target='Outcome', normalize = True, transformation=True, session_id=42)

Once the setup has been succesfully executed it prints the information grid which contains several important pieces of information. Most of the information is related to the pre-processing pipeline which is constructed when setup() is executed. The majority of these features are out of scope for the purposes of this tutorial however a few important things to note at this stage include:

**session_id** : A pseduo-random number distributed as a seed in all functions for later reproducibility. If no session_id is passed, a random number is automatically generated that is distributed to all functions. In this experiment, the session_id is set as 42 for later reproducibility.

**Transformed train and test set shapes**: Here you can see that pycaret has performed train-test-split automatically.

**Transformation method** : transformation method by default is set to ‘yeo-johnson’. The other available option for transformation is ‘quantile’. Can be changed using *transformation_method* parameter.

**Normalize method** : By default, normalize method is set to ‘zscore’. The other available option for normalizing is 'minmax'. Can be changed using *normalize_method* parameter.

In [None]:
# check all available param
get_config()

In [None]:
get_config('X_train')

In [None]:
get_config('X_test')

# Compare models

In [None]:
# list all ML models
models()

Pycaret runs all different ML algorhitms using default parameters. We can compare all models using `compare_models()` which puts all models in order from best to worst.

This gives us lots of metrics we can use to evaluate the results:


**Accuracy** = $ \frac{Correctly\:predicted}{Total\:samples}$  <br>


**Precision** = $ \frac{True\:positive}{Total\:predicted\:positive}$ <br>


**Recall** = $ \frac{True\:positive}{Total\:actually\:positive}$ <br>


**F1** = $ 2* \frac{Precision*Recall}{Precision+Recall}$ <br>



In [None]:
# best model is saved in best_model object
best_model = compare_models()

The score grid printed above highlights the highest performing metric for comparison purposes only. The grid by default is sorted using 'Accuracy' (highest to lowest) which can be changed by passing the sort parameter. For example compare_models(sort = 'Recall') will sort the grid by Recall instead of Accuracy. If you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For example compare_models(fold = 5) will compare all models on 5 fold cross validation. Reducing the number of folds will improve the training time.

Let's make few models using 10 fold stratified **cross validation**. You can change number of folds using `fold` parameter

# Create a Model

For the remaining part of this tutorial, we will work with the below models as our candidate models.

In [None]:
create_svm = create_model('svm', round=3)


trained model object is stored in the variable 'create_svm'.


In [None]:
print(create_svm)

In [None]:
create_lr= create_model('lr', round=3)


In [None]:
create_gbc = create_model('gbc', round=3)

Notice that the mean score of all models matches with the score printed in compare_models(). This is because the metrics printed in the compare_models() score grid are the average scores across all CV folds. Similar to compare_models(), if you want to change the fold parameter from the default value of 10 to a different value then you can use the fold parameter. For Example: create_model('dt', fold = 5) will create a Decision Tree Classifier using 5 fold stratified CV.

#Tune a Model

When a model is created using the `create_model()` function it uses the default hyperparameters to train the model. In order to tune hyperparameters, the `tune_model()` function is used. The `tune_model()` function is a random grid search of hyperparameters over a pre-defined search space. By default, it is set to optimize Accuracy but this can be changed using optimize parameter. For example: tune_model('svm', optimize = 'AUC') will search for the hyperparameters of a svm Classifier that results in highest AUC. For the purposes of this example, we have used the default metric Accuracy for the sake of simplicity only.

The number of iterations is defined by n_iter. By default, it is set to 10. You can change it with `n_iter` parameter

In [None]:
tuned_svm = tune_model(create_svm, round=3)

In [None]:
#tuned model object is stored in the variable 'tuned_dt'.
print(tuned_svm)

**Q1**



> Check which hyperparameters changed during tuning the model. How to pass custome grid for tuning( [check this](https://pycaret.gitbook.io/docs/get-started/functions/optimize#passing-custom-grid))? 

In [None]:
tuned_lr = tune_model(create_lr, round=3)

In [None]:
tuned_gbc = tune_model(create_gbc, round=3)

Notice how accuracy after tuning have changed like,

**SVM:** from `0.749` to `0.778` </br>
**LR:** from `0.767` to `0.785`  </br>



#Plot a Model
Before model finalization, the plot_model() function can be used to analyze the performance across different aspects such as AUC, confusion_matrix, decision boundary etc. This function takes a trained model object and returns a plot based on the test / hold-out set.

There are 15 different plots available.

In [None]:
#AUC Plot
plot_model(tuned_svm, plot = 'pr')


In [None]:
plot_model(tuned_svm, plot='feature')



> **Q2 :** What does feature importance mean? What are the most important features according to the feature plot? Plot the three most important variables with seaborn pairplot using 'Disease' as hue and see if you can notice any correlation.





In [None]:
plot_model(tuned_svm, plot = 'confusion_matrix')


plot_model(tuned_svm, plot = 'auc') This will return an error since SVM models, do not provide probability estimates directly. But AUC plot requires probability estimates to calculate the AUC score.

In [None]:
plot_model(tuned_lr, plot = 'auc')

In [None]:
plot_model(tuned_lr, plot='feature')

In [None]:
plot_model(tuned_lr, plot = 'confusion_matrix')


Another way to analyze the performance of models is to use the evaluate_model() function which displays a user interface for all of the available plots for a given model. It internally uses the plot_model() function.

In [None]:
evaluate_model(tuned_gbc)


> **Q3:** Explain the plots you have got from the three models

#Predict on test / hold-out Sample

Before finalizing the model, it is advisable to perform one final check by predicting the test/hold-out set and reviewing the evaluation metrics. If you look at the information grid after running `setup()`, you will see that 30% (119 samples) of the data has been separated out as test/hold-out sample. All of the evaluation metrics we have seen above are cross validated results based on the training set (70%) only. Now, using our final trained model stored in the tuned models (most accurate model) variable we will predict against the hold-out sample and evaluate the metrics to see if they are materially different than the cross validated results.

When data is None (default), it uses the test set (created during the setup function) for scoring.

In [None]:
predict_model(tuned_svm);


The accuracy of test set is `0.806` compared to `0.778` achieved with the train set. Dataset is quite small so in that context this is not significant difference but if there is a large variation between the test and train results, it might indicate over-fitting (if train-score is higher than test) but could also be due to several other factors and would require further investigation. In this case, accuracy of test set is higher than training so will move forward with finalizing the model.




# Finalize model

Model finalization is the last step in the experiment. The purpose of this function is to train the model on the complete dataset including test data, before it is deployed in production.

This function doesn't change any parameter of the model. It only refits on the entire dataset including the hold-out set.

In [None]:
final_svm = finalize_model(tuned_svm)
final_svm

# Predict on unseen data

Now we have a fully trained model we could start using new data. Because we used all our data for training and have no new data to test the model we can only demonstrate using the same data.

In [None]:
predictions = predict_model(final_svm, data=df) # pass the model and unseen-data as parameters
predictions.head()

The *Label* and *Score* columns can add onto the dataframe. Label is the prediction and score is the probability of the prediction. Notice that predicted results are concatenated to the original dataset while all the transformations are automatically performed in the background. You can also check the metrics using `pycaret.utils module`. You can do this easily with basic python too, but this is a simple way if you want to check any metrics (such as recall) as well. See example below:

In [None]:

# compare target and predicted labels
print("Prediction accuracy", check_metric(df['Outcome'], predictions['prediction_label'], metric = 'Accuracy'))
print("Prediction recall",check_metric(df['Outcome'], predictions['prediction_label'], metric = 'Recall'))

# Saving the model

We have now finished the experiment by finalizing the tuned_svm model which is now stored in final_svm variable. We have also used the model stored in final_svm to predict data_unseen. This brings us to the end of our experiment, but one question is still to be asked: What happens when you have more new data to predict? Do you have to go through the entire experiment again? The answer is no, PyCaret's inbuilt function save_model() allows you to save the model along with entire transformation pipeline for later use.

In [None]:
save_model(final_svm,'Final svm Model')


# load saved data

To load a saved model at a future date in the same or an alternative environment, we would use PyCaret's load_model() function and then easily apply the saved model on new unseen data for prediction.

In [None]:
saved_final_svm = load_model('Final svm Model')


Once the model is loaded in the environment, you can simply use it to predict on any new data using the same predict_model() function. Below we have applied the loaded model to predict the same dataset.

In [None]:
new_prediction = predict_model(saved_final_svm, data=df)
new_prediction


> **Q4 :**
What do you think about the acuracy scores from the trained model, tuned model and finalized model? What we can do to increase the acuracy scores?