<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode_vertical.png" width="300" alt="cognitiveclass.ai logo"  />
</center>

# **Investigation of diabetes patients readmission among US hospitals**

# Lab 5 Data Analysis with Python

Estimated time needed: **45** minutes

This lab is dedicated to the study of machine learning classification methods. The goal is to predict whether the patient will be readmitted or not.

## Objectives

* Download DataSet from * .csv files
* Conduct basic data analysis
* Calculate new and change column types
* Divide the DataSet into training and test
* Use different machine learning classification methods
* Combine classifiers into ensemble
* Calculate accuracy and analyze errors
* Combine all stages of data analysis with Pipeline

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li>Materials and methods
            <ul>
                <li>Prerequisites</li>
            </ul>
        </li>
        <li>Import Libraries</li>
        <li>Load the Dataset</li>
        <li>Data pre-preparation</li>
        <li>Pipiline Classification
             <ul>
                <li>RandomForestClassifier</li>
                 <li>Cross-validation</li>
                 <li>Accuracy</li>
            </ul>
        </li>
         <li>Over-sampling proble</li>
        <li>Ensemble of classifiers
            <ul>
                <li>Question 1</li>
            </ul>
        </li>
        <li>Conclusions</li>
        <li>Authors</li>
    </ol>
</div>

## Materials and methods

In this lab, we will learn how to download and pre-prepare data, classify and combine classifiers into an ensemble.
This lab consists of the following steps:
* Download data - download and display data from a file
* Preliminary data preparation - preliminary analysis of data structure, change of data structure and tables
* Pipeline classification - classification and analysis by grouping stages
    * Logistic regression - classification and analysis of accuracy and errors using logistic regression
    * Over-sampling problem - solve the problem of uneven distribution of data
    * Ensemble of classifiers - study various classifiers and methods of combining them into an ensemble

The data that we are going to use for this is a subset of an open source diabetes in US DataSet: https://www.kaggle.com/datasets/brandao/diabetes.
> This dataset is public available for research.
Please include this citation if you plan to use this database:
The DataSet represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks.

## Prerequisites
* [Python](https://www.python.org) - middle level
* [Pandas](https://pandas.pydata.org) - middle level 
* [Matplotlib](https://matplotlib.org) - basic level
* [SeaBorn](https://seaborn.pydata.org) - basic level
* [Scikit-Learn](https://scikit-learn.org/stable/) - middle level 

## Import Libraries/Define Auxiliary Functions

Some libraries should be imported before you can begin.

In [ ]:
!pip install imbalanced-learn

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler                         
from sklearn.compose import make_column_transformer
from sklearn import set_config
from sklearn.model_selection import train_test_split
from imblearn.pipeline import make_pipeline
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import plot_confusion_matrix
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import tree
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import recall_score, precision_score
import time
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

Let's disable warnings by **[warnings.filterwarnings()](https://docs.python.org/3/library/warnings.html)**

In [ ]:
import warnings
warnings.filterwarnings('ignore')

The next step is to download the data file from the repository by **[read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)**.

We will use the same DataSet like in previous lab. Therefore next some steps will be the same.

In [ ]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX05SZEN/clean_df.csv', index_col=0)

Now let's look at our DataSet.

In [ ]:
df

## Data pre-preparation

Let's study DataSet. As you can see DataSet consist 101388 rows × 55 columns. As you can see DataSet consist information of different types. We should be sure that python recognized data types in right way. To do this we shoul use **[pandas.info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html?highlight=info#pandas.DataFrame.info)**.

In [ ]:
df.info()

<details>
<summary><b>Click to see attribute information</b></summary>

1. `Encounter Id` - Unique identifier of an encounter  (int64)
2. `Patient Number` - Unique identifier of a patient (int64)
3. `Race` - (categorical: `Caucasian` `AfricanAmerican` `Other` `Asian` `Hispanic`)
4. `Gender` - (categorical: `Female` `Male` `Unknown/Invalid`)
5. `Age` -  Grouped in 10-year intervals (categorical: `[0-10)` `[10-20)` `[20-30)` `[30-40)` `[40-50)` `[50-60)` `[60-70)` `[70-80)` `[80-90)` `[90-100)`)
6. `Weight` -  Weight in pounds (categorical: `[75-100)` `[50-75)` `[0-25)` `[100-125)` `[25-50)` `[125-150)` `[175-200)` `[150-175)` `>200`)
7. `Admission Type Id` - Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available (int64)
8. `Discharge Disposition Id` - Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available (int64)
9. `Admission Source Id` - Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital (int64)
10. `Time In Hospital` - Integer number of days between admission and discharge (int64)
11. `Payer Code` - Integer identifier corresponding to 23 distinct values, for example, Blue Cross\Blue Shield, Medicare, and self-pay (categorical)
12. `Medical Specialty` - Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family\general practice, and surgeon (categorical)
13. `Num Lab Procedures` - Number of lab tests performed during the encounter (float64)
14. `Num Procedures` -  Number of procedures (other than lab tests) performed during the encounter (int64)
15. `Num Medications` - Number of distinct generic names administered during the encounter (int64)
16. `Number Outpatient` - Number of outpatient visits of the patient in the year preceding the encounter (int64)
17. `Number Emergency` - Number of emergency visits of the patient in the year preceding the encounter (int64)
18. `Number Inpatient` - Number of inpatient visits of the patient in the year preceding the encounter(int64)
19. `Diagnosis1` - The primary diagnosis (coded as first three digits of ICD9) (categorical)
20. `Diagnosis2` - Secondary diagnosis (coded as first three digits of ICD9) (categorical)
21. `Diagnosis3` - Additional secondary diagnosis (coded as first three digits of ICD9) (categorical)
22. `Number Diagnoses` - Number of diagnoses entered to the system (float64)
23. `Max Glu Serum` - Indicates the range of the result or if the test was not taken. Values: `>200`, `>300`, `normal`, and `none` if not measured (categorical)
24. `A1c Result` - Indicates the range of the result or if the test was not taken. Values: `>8` if the result was greater than 8%, `>7` if the result was greater than 7% but less than 8%, `normal` if the result was less than 7%, and “none” if not measured (categorical)
25. `Metformin` - patient medications (categorical)
26. `Repaglinide` - patient medications (categorical)
27. `Nateglinide` - patient medications (categorical)
28. `Chlorpropamide` - patient medications (categorical)
29. `Glimepiride` - patient medications (categorical)
30. `Acetohexamide` - patient medications (categorical)
31. `Glipizide` - patient medications (categorical)
32. `Glyburide` - patient medications (categorical)
33. `Tolbutamide` - patient medications (categorical)
34. `Pioglitazone` - patient medications (categorical)
35. `Acarbose` - patient medications (categorical)
36. `Miglitol` - patient medications (categorical)
37. `Troglitazone` - patient medications (categorical)
38. `Tolazamide` - patient medications (categorical)
39. `Examide` - patient medications (categorical)
40. `Citoglipton` - patient medications (categorical)
41. `Insulin` - patient medications (categorical)
42. `Glyburide-metformin` - patient medications (categorical)
43. `Glipizide-metformin` - patient medications (categorical)
44. `Glimepiride-pioglitazone` - patient medications (categorical)
45. `Metformin-rosiglitazone` - patient medications (categorical)
46. `Metformin-pioglitazone` - patient medications (categorical)
47. `Diabetes Medication` -  Indicates if there was any diabetic medication prescribed. Values: `True` and `False` (bool)
48. `Readmitted` [Target Column] - Days to inpatient readmission. Values: `<30` if the patient was readmitted in less than 30 days, `>30` if the patient was readmitted in more than 30 days, and `No` for no record of readmission (categorical)
49. `ages-binned`(categorical)
50. `change_yes` - columns created in previous labs (int64)
51. `change_no` - columns created in previous labs (int64)
52. `Increased` - columns created in previous labs (int64)
53. `No` - columns created in previous labs (int64)
54. `Steady` - columns created in previous labs (int64)
55. `Decreased` - columns created in previous labs (int64)
    
    </details>

Let's study information of DataSet columns.

Here we have a lot of columns that have a limited set of values and their type is "object", so for correct analysis, change their type to categorical.

In [ ]:
obj_cols = df.select_dtypes(include='object').columns

df[obj_cols] = df[obj_cols].astype('category')

Now let's delete columns that have no impact on our model, as we did in the previous lab.(**[pandas.DataFrame.drop()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)**).

In [ ]:
df.drop(['Encounter Id', 'Patient Number', 'Payer Code','Examide','Citoglipton'], inplace=True, axis=1)

The resulting dataset will be sized (**[pandas.DataFrame.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html?highlight=shape#pandas.DataFrame.shape)**):

In [ ]:
df.shape

In [ ]:
df.head()

## Pipiline Classification

### RandomForestClassifier

Before classification, the dataset must be divided into input and target factors.

In [ ]:
x = df.drop(columns = ['Readmitted'])

In [ ]:
y = df['Readmitted']

In [ ]:
x.info()

You can see the input data set consists from 49 columns.
As you can see, 31 columns are categorical, 15 - numerical, 1 boolean and 2 float. To make classification, all numerical, boolean and float fields must be normalized and categorical fields must be digitized. This can be automated using the **[sklearn.preprocessing.OrdinalEncoder()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)** and **[sklearn. preprocessing.StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)**.

Since the machine learning process consists of several steps, each of which has the function `fit`,` predict` and etc, we can combine all these stages into one block using `Pipeline` (**[sklearn.pipeline.make_pipeline()](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html)**), **[sklearn.compose.make_column_transformer()](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)** and visualize it with: **[sklearn.set_config()](https://scikit-learn.org/stable/modules/generated/sklearn.set_config.html)**.

Select all categorical columns

In [ ]:
cat_col = x.select_dtypes(include=['category']).columns

Here select all numerical, boolean and float columns

In [ ]:
numeric_col = x.select_dtypes(include=['int64','float','boolean']).columns

Now create transformer for our previously selected columns.

In [ ]:
trans = make_column_transformer((OrdinalEncoder(handle_unknown = 'use_encoded_value',unknown_value = -1),cat_col),
                                (StandardScaler(),numeric_col),
                                remainder = 'passthrough')
set_config(display = 'diagram')
trans

Next we must separate DataSets for train and test DataSets for calculate accuracy of models. To do this we can use **[sklearn.model_selection.train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)**. Let's separate DataSets in 0.33 proportion train/test

In [ ]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.33, shuffle=False)

In [ ]:
x_train.shape

In [ ]:
x_test.shape

Now let's create a RandomForestClassifier model (**[sklearn.linear_model.RandomForestClassifier()](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**) and add it to our `Pipeline`.

In [ ]:
rfc = RandomForestClassifier()
pipe_rfc = make_pipeline(trans,rfc)

Let's fit our model and calculate its accuracy.

In [ ]:
pipe_rfc.fit(x_train,y_train)

### Cross-validation

Cross-validation is a technique in machine learning where the available DataSet is split into multiple subsets or folds, and the model is trained and tested on different subsets in a rotation. The primary purpose of cross-validation is to estimate how well the model is expected to perform when it is deployed to make predictions on new, unseen data.

One common way to implement cross-validation is by using the cross_val_score helper function, which takes an estimator (the model to be trained and tested) and the DataSet, and returns the scores from each fold. This allows for easy evaluation and comparison of different models based on their performance metrics.

In [ ]:
Rcross = cross_val_score(pipe_rfc,x ,y, cv = 4)
print(Rcross)
print("The mean of the folds are", Rcross.mean(), "and the standart deviation is", Rcross.std())

In [ ]:
yhat = cross_val_predict(pipe_rfc,x,y,cv = 4)
yhat[0:5]

### Accuracy

In [ ]:
scores_train = pipe_rfc.score(x_train, y_train)
scores_test = pipe_rfc.score(x_test, y_test)
print('Training DataSet accuracy: {: .1%}'.format(scores_train), 'Test DataSet accuracy: {: .1%}'.format(scores_test))

 As we use a random forest classifier, accuracy can change a little, so to get a better result, you can restart an upper block of code.

Let's evaluate the correctness of the classification with: **[sklearn.metrics.plot_confusion_matrix()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html)** and convince of these conclusions.

In [ ]:
plot_confusion_matrix(pipe_rfc, x_test, y_test,cmap=plt.cm.Blues)
plt.show()

As you can see, for test accuracy, we get ~57%.

The `Recall` metric is used to assess the accuracy of only purchased goods: **[sklearn.metrics.recall_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)**

In [ ]:
scores_train = recall_score(y_train, pipe_rfc.predict(x_train), average='micro')
scores_test = recall_score(y_test, pipe_rfc.predict(x_test), average='micro')
print('Training DataSet accuracy: {: .1%}'.format(scores_train), 'Test DataSet accuracy: {: .1%}'.format(scores_test))

As can be seen from this metric, the accuracy is very low. Moreover, the accuracy of the training and test data are approximately the same. This means that in order to increase this metric of accuracy, it is necessary to increase the training sample. Let's analyze it.

### Over-sampling problem

Let's analyze readmission (**[seaborn.countplot()](https://seaborn.pydata.org/generated/seaborn.countplot.html)**):

In [ ]:
sns.countplot(x = y)

As you can see, the number of rejections is much greater than the number of accepted proposals. To balance the data set, we can use a special function: **[imblearn.over_sampling.RandomOverSampler()](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html)**:

In [ ]:
ROS = RandomOverSampler()
pipe_ros = make_pipeline(trans,ROS)
o_x, o_y = pipe_ros.fit_resample(x_test,y_test)
sns.countplot(x = o_y)

Let's add this function to our `Pipeline`, fit the model and recalculate the accuracy.

In [ ]:
pipe_s_rfc = make_pipeline(trans, ROS, rfc)
pipe_s_rfc

In [ ]:
pipe_s_rfc.fit(x_train,y_train)
scores_train = recall_score(y_train, pipe_s_rfc.predict(x_train), average = 'weighted')
scores_test = recall_score(y_test, pipe_s_rfc.predict(x_test), average = 'weighted')
print('Training DataSet accuracy: {: .1%}'.format(scores_train), 'Test DataSet accuracy: {: .1%}'.format(scores_test))

As you can see, balancing the dataset has led to a decrease in the accuracy of the `Recall` metric.

Let's analyze the errors of the model.

In [ ]:
plot_confusion_matrix(pipe_s_rfc, x_test, y_test,cmap=plt.cm.Blues)
plt.show()

As you can see values slightly increased. However, the error is high when the model predicts a patient readmission. The metric `Precision` is used to assess this accuracy.

### Ensemble of classifiers

Let's test other classifiers and compare the results.
We will test:

* [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#sklearn.linear_model.LogisticRegression)
* [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decisiontreeclassifier#sklearn.tree.DecisionTreeClassifier)
* [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforestclassifier#sklearn.ensemble.RandomForestClassifier)
* [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
* [Ada Boost Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html?highlight=adaboostclassifier#sklearn.ensemble.AdaBoostClassifier)
* [Etra Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)
* [Gradient Boosting Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

In addition, different classifiers may err in different situations. Therefore, to compensate for each other's mistakes, it is necessary to use model ensembles by Voting Classifier.

A **[Voting Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)** is a machine learning model that trains on an ensemble of numerous models and predicts an output (class) based on their highest probability of chosen class as the output.
It simply aggregates the findings of each classifier passed into Voting Classifier and predicts the output class based on the highest majority of voting. The idea is instead of creating separate dedicated models and finding the accuracy for each them, we create a single model which trains by these models and predicts output based on their combined majority of voting for each output class.

Voting Classifier supports two types of votings.

**Hard Voting**: In hard voting, the predicted output class is a class with the highest majority of votes i.e the class which had the highest probability of being predicted by each of the classifiers. Suppose three classifiers predicted the output class(A, A, B), so here the majority predicted A as output. Hence A will be the final prediction.


**Soft Voting**: In soft voting, the output class is the prediction based on the average of probability given to that class. Suppose given some input to three models, the prediction probability for class A = (0.30, 0.47, 0.53) and B = (0.20, 0.32, 0.40). So the average for class A is 0.4333 and B is 0.3067, the winner is clearly class A because it had the highest probability averaged by each classifier.


In [ ]:
clf_s = make_pipeline(trans, ROS)
names = ["Logistic Regression",
         "Decision Tree", "Random Forest","Gaussian Naive Bayes"]
classifiers = [
    LogisticRegression(),
    DecisionTreeClassifier(max_depth=8),
    RandomForestClassifier(n_estimators=10, max_features=1),
    GaussianNB(),
    ]

scores_train = []
scores_test = []
scores_train_s = []
scores_test_s = []


You can use other classifiers such as Extra tree classifier, Gradient Boosting Classifier ,etc. But it takes significantly more time to calculate.

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  1: </h1>

<b>Try out other classification models.</b>

</div>

In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click <b>here</b> for the solution</summary>
    
```python

names += ["Ada Boost Classifier","Etra Tree Classifier","Gradient Boosting Classifier"]
classifiers += [
    AdaBoostClassifier(),
    ExtraTreesClassifier(),
    GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0),
    ]
```
    
</details>

Run all classifiers.

In [ ]:
est = [(str(est), est) for est in classifiers]
eclf = [VotingClassifier(
     estimators=est,
     voting='hard')]
names += ["Voting Classifier"]
classifiers += eclf
for name, classif in zip(names, classifiers):
    start_time = time.time()
    print(name,'fitting.....',end = '')
    clf = make_pipeline(trans, classif)
    clf.fit(x_train,y_train)
    score_train = recall_score(y_train, clf.predict(x_train), average='micro')
    score_test = recall_score(y_test, clf.predict(x_test), average='micro')
    scores_train.append(score_train)
    scores_test.append(score_test)
    
    clf_s = make_pipeline(trans, ROS, classif)
    clf_s.fit(x_train,y_train)
    score_train_s = recall_score(y_train, clf_s.predict(x_train), average='micro')
    score_test_s = recall_score(y_test, clf_s.predict(x_test), average='micro')
    scores_train_s.append(score_train_s)
    scores_test_s.append(score_test_s)
    end_time = time.time()
    print(" [",round(end_time - start_time,2),"s]")

Let's compare the accuracy of classifiers for balanced and unbalanced data sets.

In [ ]:
res = pd.DataFrame(index = names)
res['Train'] = np.array(scores_train)
res['Test'] = np.array(scores_test)
res['Train Over Sampler'] = np.array(scores_train_s)
res['Test Over Sampler'] = np.array(scores_test_s)

res.index.name = "Classifier accuracy"
pd.options.display.float_format = '{:,.2f}'.format
res

Diagram representation of table above.

In [ ]:
fig, ax = plt.subplots(figsize=(10,6))
ax.bar(names, scores_test)
ax.bar(names, scores_test_s)
ax.legend(['Test', 'Test Over Sampler'])

ax.set_title('Calassifiers Accuracy')
ax.set_xlabel('Classifier')
ax.set_ylabel('Accuracy')

plt.xticks(rotation=45)
plt.show()

As you can see, for our DataSet, balancing only decreased accuracy. It can also be seen that the most accurate model was logistic regression, Ada Boost, Gradient Boosting and voting classifier. The ensemble of models showed better accuracy on the training data set and slightly worse on the test.

Let's display the last classifier:

In [ ]:
clf_s

## Conclusions

In this lab we studied how to normalize numerical and categorical data. It was shown how to build training and test data sets. Shows how to fit different classifiers, evaluate their accuracy and analyze errors.
We also studied how to join them together in an ensemble and create a model based on Pipeline.
We compared the accuracy of different classifiers and their ensemble and showed how they can be used in diabetes patients prediction.

The accuracy of the decision was about 60%.

## Author

### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/dmytro_yesyp" target="_blank">Dmytro Yesyp</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/nataliya_boyko">Ass. Prof. Nataliya Boyko, PhD</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |
| 2023-03-25       | 01     | Dmytro Yesyp     | Lab created|
<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. <h3/>
