<a href="https://colab.research.google.com/github/saeid-uot/UoT_MHI/blob/main/MHI_Predicting_Diabetic_PyCaret.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NOTE: If you want to **save** this notebook in your Google Drive, please go to "File" menu on top left corner of the notebook, then select "**Save a Copy in Drive**"

After saving, you can find the saved notebook in GDrive under "Colab Notebooks" folder

#Objective:
In this notebook, you will learn how to install PyCaret Python library and understand how to train and evaluate a machine learning model using PyCaret.

For this notebook, we use **Pima Indian Diabetes** Dataset, originally from the National Institute of Diabetes and Digestive and Kidney Diseases, contains information of **768 women** from a population near Phoenix, Arizona, USA.

The outcome tested was Diabetes,

*   258 tested positive
*   500 tested negative

Therefore, there is one target (dependent) variable and the 8 attributes (TYNECKI, 2018):
* pregnancies
* OGTT(Oral Glucose Tolerance Test)
* blood pressure
* skin thickness
* insulin
* BMI(Body Mass Index)
* age
* pedigree diabetes function.

The Pima population has been under study by the National Institute of Diabetes and Digestive and Kidney Diseases at intervals of 2 years since 1965. As epidemiological evidence indicates that T2DM results from interaction of genetic and environmental factors, the Pima Indians Diabetes Dataset includes information about attributes that could and should be related to the onset of diabetes and its future complications.

Exploratory analysis of the dataset can be found here https://www.kaggle.com/code/gauravduttakiit/diabetes-prediction-with-pycaret/data

# Installing PyCaret

In [None]:
#Important Note: Please rerun this cell, if you got any error.
!pip install pycaret

Collecting pycaret
  Downloading pycaret-3.2.0-py3-none-any.whl (484 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.7/484.7 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting category-encoders>=2.4.0 (from pycaret)
  Downloading category_encoders-2.6.3-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting deprecation>=2.1.0 (from pycaret)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting kaleido>=0.2.1 (from pycaret)
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting matplotlib<=3.6,>=3.3.0 (from pycaret)
  Downloading matplotlib-3.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11

# Step 1: Load Dataset

In [None]:
# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
diabetes.describe()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration a 2 hours in an oral glucose tolerance test,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-Hour serum insulin (mu U/ml),Body mass index (weight in kg/(height in m)^2),Diabetes pedigree function,Age (years),Class variable
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


#Step 2: Initiate/Setup PyCaret

In [None]:
# init setup
# Important Note: Remember to Press Enter in the blue text box below in the output.
# Code won't continue untill you press Enter
from pycaret.classification import *
pycaret_setup = setup(data = diabetes, target = 'Class variable', numeric_features=['Number of times pregnant'])

Unnamed: 0,Description,Value
0,Session id,938
1,Target,Class variable
2,Target type,Binary
3,Original data shape,"(768, 9)"
4,Transformed data shape,"(768, 9)"
5,Transformed train set shape,"(537, 9)"
6,Transformed test set shape,"(231, 9)"
7,Numeric features,1
8,Preprocess,True
9,Imputation type,simple


#Step 3: Create Model!
Lets create a Logistic Regression Model using the keyword 'lr'

In [None]:
mymodel = create_model('lr')#,sort="F1")

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7778,0.8602,0.5789,0.7333,0.6471,0.4882,0.4954
1,0.7593,0.9158,0.7368,0.6364,0.6829,0.4906,0.494
2,0.7593,0.8346,0.6316,0.6667,0.6486,0.4658,0.4661
3,0.7963,0.8481,0.6316,0.75,0.6857,0.5367,0.541
4,0.7407,0.797,0.5789,0.6471,0.6111,0.4176,0.419
5,0.7222,0.8421,0.4211,0.6667,0.5161,0.335,0.3524
6,0.8148,0.803,0.6842,0.7647,0.7222,0.584,0.586
7,0.8113,0.8317,0.6111,0.7857,0.6875,0.5554,0.5644
8,0.7358,0.7349,0.4444,0.6667,0.5333,0.3592,0.3736
9,0.7736,0.8413,0.5,0.75,0.6,0.4508,0.4688


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

# Evaluate Model

In [None]:
evaluate_model(mymodel)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

#Compare Models: In this section, we train multiple models at once

In [None]:
# compare models
best = compare_models(include = ['lr', 'dt', 'lightgbm'], sort="Recall")

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.7597,0.8163,0.6137,0.6722,0.6326,0.4566,0.4652,0.223
lr,Logistic Regression,0.7691,0.8309,0.5819,0.7067,0.6335,0.4683,0.4761,0.062
dt,Decision Tree Classifier,0.6649,0.6341,0.5339,0.5213,0.5238,0.2666,0.2684,0.036


Processing:   0%|          | 0/17 [00:00<?, ?it/s]

In [None]:
best

In [None]:
evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [None]:
tuned_model = tune_model(best)
#lr	Logistic Regression	0.7578	0.8078	0.5310	0.7082	0.6014	0.4338	0.4462	0.591

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6852,0.7489,0.4737,0.5625,0.5143,0.2839,0.2862
1,0.8333,0.9188,0.9474,0.6923,0.8,0.663,0.687
2,0.7222,0.806,0.5789,0.6111,0.5946,0.3836,0.3839
3,0.8148,0.8617,0.6842,0.7647,0.7222,0.584,0.586
4,0.7407,0.7805,0.4737,0.6923,0.5625,0.3874,0.4014
5,0.7222,0.815,0.4737,0.6429,0.5455,0.352,0.3605
6,0.8148,0.8286,0.6316,0.8,0.7059,0.5735,0.582
7,0.7736,0.8349,0.6111,0.6875,0.6471,0.4812,0.483
8,0.717,0.7762,0.2778,0.7143,0.4,0.2591,0.3086
9,0.717,0.8,0.4444,0.6154,0.5161,0.3234,0.332


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 10 folds for each of 10 candidates, totalling 100 fits


Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [None]:
evaluate_model(tuned_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

#Your Turn:

- Grab a piece of paper or open a text file in your computer. Based on your understanding from the lecture, write down the high level steps to train a model using PyCaret. Ask any question you may have.
- run the commands: **models()** to find the list of available algorithms, like decision tree, random forest to train.
- Using the codes explained in the above cells, use PyCaret to train 2 models of decision tree and random forest and compare their confusion matrix. Which model performs better and why?

You do not need to submit your answer. This is a warm up to prepare you for your assignment!

In [None]:
#Your code here


#Now train a separate random forest model. What are the precision, recall, and the F1 score?



In [None]:
#Your code here

#Tune model
One question might be, how to further improve the model.

In Machine Learning, one way to improve the performance of a model (aka, reduce prediction error) is to optimize (or tune) the model-specific parameters; this is called hyperparameter tuning which is a fairly complicated process.

Fortunately, Pycaret provides a simple way to automate the model that you choose.

Below, train a new randor forest model and call it ***rf***. Then use **tune_model(rf)** to optimize the model. Please be patient, it may takes a few minutes to optimize the model. Compare the results of rf and optimized rf. What is the difference and how do you interpret the difference?