<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode_vertical.png" width="300" alt="cognitiveclass.ai logo"  />
</center>

# **Investigation of diabetes patients readmission among US hospitals**

# Lab 4 Data Analysis with Python

Estimated time needed: **30** minutes

## Objectives

1. preprocess (normilize and transform categorical data) and create DataSet
2. select features 
3. make classification of clients
4. visualize decision tree of classification model  

## Table of Contents


<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li>Materials and Methods</li>
        <li>Import Libraries</li>
        <li>Load the Dataset</li>
        <li>Data preparation
            <ul>
                <li>Data transformation</li>
                <li>Encoding and Normalization</li>
            </ul>
        </li>
        <li>Features selection
             <ul>
                <li>Chi-Squared Statistic</li>
                <li>Mutual Information Statistic</li>
                <li>Feature Importance</li>
                <li>Correlation Matrix with Heatmap</li>
            </ul>
        </li>
         <li>Tasks</li>
        <li>Classification models
            <ul>
                <li>Train and Test DataSets creation</li>
                <li>Extra Trees Classifier</li>
                <li>Logistic regression </li>
            </ul>
        </li>
        <li>Decision tree
            <ul>
                <li>Build model</li>
                <li>Visualization of decision tree</li>
            </ul>
        </li>
        <li>Conclusions</li>
        <li>Authors</li>
    </ol>
</div>


----

## 1. Materials and Methods

The data that we are going to use for this is a subset of an open source diabetes in US DataSet: https://www.kaggle.com/datasets/brandao/diabetes.

> This dataset is public available for research.
Please include this citation if you plan to use this database:
The DataSet represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks.

It is important to know if a patient will be readmitted in some hospital. The reason is that you can change the treatment, in order to avoid a readmission.

In this lesson, we will try to give answers to a set of questions that may be relevant when analyzing diabetes data:

1. What are the most useful Python libraries for classification analysis?
2. How to transform category data?
3. How to create DataSet?
4. How to do features selection?
5. How to make, fit and visualize classification model?

In addition, we will make the conclusions for the obtained results of our classification analysis to predict if a patient will be readmitted.

[Scikit-learn](https://scikit-learn.org/stable/) (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

## 2. Import Libraries

Download data using a URL.

In [ ]:
# !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip

Alternative URL for the dataset downloading.

In [ ]:
# !wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VDA_Banking_L2/bank-additional.zip

Unzipping to a folder. It is a good idea to apply the `-o` and `-q`  when unzipping to quiet the process and overwrite any existing folders.


In [ ]:
# !unzip -o -q bank-additional.zip

Import the libraries necessary to use in this lab. We can add some aliases to make the libraries easier to use in our code and set a default figure size for further plots. Ignore the warnings.


In [ ]:
!conda install --yes scikit-learn==0.24.2
!conda install --yes python-graphviz

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
plt.rcParams["figure.figsize"] = (8, 6)
# Data transformation
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import MinMaxScaler
# Features Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, mutual_info_classif
# Classificators
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree
# warnings deactivate
import warnings
warnings.filterwarnings('ignore')

## 3 .Load the Dataset

We will use the same DataSet like in previous labs. Therefore next some steps will be the same

In [ ]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX027CEN/clean_df.csv', index_col=0)
df.head(5)

In [ ]:
df.shape

As you can see DataSet consist of 56 columns. Target column is "Readmitted". Also DataSet consist 101745 rows. In previous labs we investigated these columns.

<details>
<summary><b>Click to see attribute information</b></summary>
Input features (column names):

1. `Encounter Id` - Unique identifier of an encounter  (int64)
2. `Patient Number` - Unique identifier of a patient (int64)
3. `Race` - (categorical: `Caucasian` `AfricanAmerican` `Other` `Asian` `Hispanic`)
4. `Gender` - (categorical: `Female` `Male` `Unknown/Invalid`)
5. `Age` -  Grouped in 10-year intervals (categorical: `[0-10)` `[10-20)` `[20-30)` `[30-40)` `[40-50)` `[50-60)` `[60-70)` `[70-80)` `[80-90)` `[90-100)`)
6. `Weight` -  Weight in pounds (categorical: `[75-100)` `[50-75)` `[0-25)` `[100-125)` `[25-50)` `[125-150)` `[175-200)` `[150-175)` `>200`)
7. `Admission Type Id` - Integer identifier corresponding to 9 distinct values, for example, emergency, urgent, elective, newborn, and not available (int64)
8. `Discharge Disposition Id` - Integer identifier corresponding to 29 distinct values, for example, discharged to home, expired, and not available (int64)
9. `Admission Source Id` - Integer identifier corresponding to 21 distinct values, for example, physician referral, emergency room, and transfer from a hospital (int64)
10. `Time In Hospital` - Integer number of days between admission and discharge (int64)
11. `Payer Code` - Integer identifier corresponding to 23 distinct values, for example, Blue Cross\Blue Shield, Medicare, and self-pay (categorical)
12. `Medical Specialty` - Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family\general practice, and surgeon (categorical)
13. `Num Lab Procedures` - Number of lab tests performed during the encounter (float64)
14. `Num Procedures` -  Number of procedures (other than lab tests) performed during the encounter (int64)
15. `Num Medications` - Number of distinct generic names administered during the encounter (int64)
16. `Number Outpatient` - Number of outpatient visits of the patient in the year preceding the encounter (int64)
17. `Number Emergency` - Number of emergency visits of the patient in the year preceding the encounter (int64)
18. `Number Inpatient` - Number of inpatient visits of the patient in the year preceding the encounter(int64)
19. `Diagnosis1` - The primary diagnosis (coded as first three digits of ICD9) (categorical)
20. `Diagnosis2` - Secondary diagnosis (coded as first three digits of ICD9) (categorical)
21. `Diagnosis3` - Additional secondary diagnosis (coded as first three digits of ICD9) (categorical)
22. `Number Diagnoses` - Number of diagnoses entered to the system (float64)
23. `Max Glu Serum` - Indicates the range of the result or if the test was not taken. Values: `>200`, `>300`, `normal`, and `none` if not measured (categorical)
24. `A1c Result` - Indicates the range of the result or if the test was not taken. Values: `>8` if the result was greater than 8%, `>7` if the result was greater than 7% but less than 8%, `normal` if the result was less than 7%, and “none” if not measured (categorical)
25. `Metformin` - patient medications (categorical)
26. `Repaglinide` - patient medications (categorical)
27. `Nateglinide` - patient medications (categorical)
28. `Chlorpropamide` - patient medications (categorical)
29. `Glimepiride` - patient medications (categorical)
30. `Acetohexamide` - patient medications (categorical)
31. `Glipizide` - patient medications (categorical)
32. `Glyburide` - patient medications (categorical)
33. `Tolbutamide` - patient medications (categorical)
34. `Pioglitazone` - patient medications (categorical)
35. `Acarbose` - patient medications (categorical)
36. `Miglitol` - patient medications (categorical)
37. `Troglitazone` - patient medications (categorical)
38. `Tolazamide` - patient medications (categorical)
39. `Examide` - patient medications (categorical)
40. `Citoglipton` - patient medications (categorical)
41. `Insulin` - patient medications (categorical)
42. `Glyburide-metformin` - patient medications (categorical)
43. `Glipizide-metformin` - patient medications (categorical)
44. `Glimepiride-pioglitazone` - patient medications (categorical)
45. `Metformin-rosiglitazone` - patient medications (categorical)
46. `Metformin-pioglitazone` - patient medications (categorical)
47. `Diabetes Medication` -  Indicates if there was any diabetic medication prescribed. Values: `True` and `False` (bool)
48. **`Readmitted` [Target Column]** - Days to inpatient readmission. Values: `<30` if the patient was readmitted in less than 30 days, `>30` if the patient was readmitted in more than 30 days, and `No` for no record of readmission (categorical)
49. `ages-binned`(categorical)
50. `change_yes` - columns created in previous labs (int64)
51. `change_no` - columns created in previous labs (int64)
52. `Increased` - columns created in previous labs (int64)
53. `No` - columns created in previous labs (int64)
54. `Steady` - columns created in previous labs (int64)
55. `Decreased` - columns created in previous labs (int64)

</details>

Our goal is create the classification model that can predict  if the client will subscribe a term deposit or no? To do this we must analize and prepare data for such type of model.

## 4. Data preparation

### Data transformation

First of all we should investigate how pandas recognized types of features

In [ ]:
df.info()

As you can see all categorical features was recogized like object. We must change thair type on "categorical". 

In [ ]:
col_cat = list(df.select_dtypes(include=['object']).columns)
col_cat

Let's look at the dataset size.

In [ ]:
df.loc[:, col_cat] = df[col_cat].astype('category')
df.info()

To see the unical values of exact feature (column) we can use:

In [ ]:
df['Race'].unique()

As was signed earlier the dataset contains 101388 objects (rows), for each of which 52 features are set (columns), including 1 target feature (Readmitted). 37 features, including target are categorical. These data type of values cannot use for classification. We must transform it to int or float.
To do this we can use **[LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)** and **[OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)**. These functions can encode categorical features as an integer array.

Firs of all we separate DataSet on input and output(target) DataSets

### Excluding columns that do not affect target columns
Columns that do not affect the result be can excluded.
In our case such columns are  - `Encounter Id`, `Patient Number` and `Payer Code`.

In [ ]:
df.drop(['Encounter Id','Patient Number','Payer Code'], inplace=True, axis=1)

In [ ]:
# from X we have to remove target column (Readmitted).
X = df.iloc[:,df.columns != 'Readmitted']
y = df["Readmitted"]
print(X)

### Encoding and Normalization

Than create list of categorical fields and transform thair values to int arrays: (Replace ##YOUR CODE GOES HERE## with your Python code.)

 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  1: </h1>

<b>Сreate list of categorical fields and transform thair values to int arrays.</b>

</div>

In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click <b>here</b> for the solution</summary>

```python
col_cat =df.iloc[:,df.columns != 'Readmitted'].select_dtypes(include=['category']).columns

oe = OrdinalEncoder()
oe.fit(X[col_cat])
X_cat_enc = oe.transform(X[col_cat])
```

</details>

In [ ]:
X_cat_enc

Than we must transform arrays back into DataFrame:

In [ ]:
X_cat_enc = pd.DataFrame(X_cat_enc)
X_cat_enc.columns = col_cat
X_cat_enc

Numerical fields can have different scale and can consists negative values. These will lead to round mistakes and exeptions for some AI methods. To avoid it these features must be normalized.

Let's create list of numerical fields and normilize it using by **[MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)**

 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  2: </h1>

<b>Create list of numerical, float and boolean fields and normilize it using by MinMaxScaler.</b>

</div>

In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click <b>here</b> for the solution</summary>
    
```python
col_num = df.select_dtypes(include=['int64','float','boolean']).columns

scaler = MinMaxScaler(feature_range=(0, 1))
X_num_enc = scaler.fit_transform(X[col_num])
```

</details>

In [ ]:
X_num_enc

Like in previous case transform back obtained arrays into DataFrame

In [ ]:
X_num_enc = pd.DataFrame(X_num_enc)
X_num_enc.columns = col_num
X_num_enc

Then we should concatenate these DataFrames in one input DataFrame

 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  3: </h1>

<b>Concatenate these DataFrames in one input DataFrame.</b>

</div>

In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click <b>here</b> for the solution</summary> 

```python  
x_enc = pd.concat([X_cat_enc, X_num_enc], axis=1)
x_enc
```
    
</details>

The same transformation we must do for target field

In [ ]:
le = LabelEncoder()
le.fit(y)
y_enc = le.transform(y)
y_enc = pd.Series(y_enc)
y_enc.columns = y.name

In [ ]:
y

In [ ]:
y_enc

As you can see values '<30' was changed to 0, '>30' to 1 and 'NO' to 2

## 5. Features selection

As was signed before input fields consists 20 features. Of coure some of them are more significant for classification. 

There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable.

They are:

* Chi-Squared Statistic.
* Mutual Information Statistic.

Let’s take a closer look at each in turn.

To do this we can use **[SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)**

### Chi-Squared Statistic

Pearson’s chi-squared statistical hypothesis test is an example of a test for independence between categorical variables.

You can learn more about this statistical test in the tutorial:

[A Gentle Introduction to the Chi-Squared Test for Machine Learning](https://machinelearningmastery.com/chi-squared-test-for-machine-learning/)
The results of this test can be used for feature selection, where those features that are independent of the target variable can be removed from the dataset.

The scikit-learn machine library provides an implementation of the chi-squared test in the **[chi2()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2)** function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

For example, we can define the SelectKBest class to use the chi2() function and select all (or most significant) features, then transform the train and test sets.

Apply SelectKBest class to extract top 10 best features

In [ ]:
bestfeatures = SelectKBest(score_func=chi2, k=5)
fit = bestfeatures.fit(x_enc,y_enc)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

concat two dataframes for better visualization 

In [ ]:
featureScores = pd.concat([dfcolumns, dfscores],axis=1).dropna()
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(round(featureScores.nlargest(5,'Score'),2))  #print 5 best features

### Mutual Information Statistic

Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

[You can learn more about mutual information in the following tutorial.](https://machinelearningmastery.com/information-gain-and-mutual-information)

The scikit-learn machine learning library provides an implementation of mutual information for feature selection via the **[mutual_info_classif()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif)** function.

Like chi2(), it can be used in the SelectKBest feature selection strategy (and other strategies).

 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  4: </h1>

<b>Concatenate these DataFrames in one input DataFrame.</b>

</div>

In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click <b>here</b> for the solution</summary>

```python
bestfeatures = SelectKBest(score_func=mutual_info_classif, k=5)
fit = bestfeatures.fit(x_enc,y_enc)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
featureScores = pd.concat([dfcolumns, dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(round(featureScores.nlargest(5,'Score'),2))  #print 5 best features
```
    
</details>

As you can see these 2 function select different significant features.

### Feature Importance

You can get the feature importance of each feature of your DataFrame by using the feature importance property of the exact classification model.
Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.
For example:
Feature importance is an inbuilt class that comes with **[Tree Based Classifiers](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)**, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.

Let's create and fit the model:

In [ ]:
model = ExtraTreesClassifier()
model.fit(x_enc,y_enc)

use inbuilt class feature_importances of tree based classifiers

In [ ]:
print(model.feature_importances_)

Let's transform it into Series and plot graph of feature importances for better visualization

In [ ]:
feat_importances = pd.Series(model.feature_importances_, index=x_enc.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.show()

You can see that for Extra Tree Classifier impotance of features are different than in previous cases. It means that there are not exact rules for features selection. And their impotance strictly depedence on model.

### Correlation Matrix with Heatmap

Correlation states how the features are related to each other.
Correlation can be positive (increase in one value of feature increases the value of the other variable) or negative (increase in one value of feature decreases the value of the other variable)
Heatmap makes it easy to identify which features are most related to the other variable, we will plot heatmap of correlated features using the seaborn library.

In [ ]:
corrmat = x_enc.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,10))
g=sns.heatmap(x_enc[top_corr_features].corr(),annot=False,cmap="RdYlGn")

We can notice that "Examide" and "Citoglipton" columns have no correlation on our HeatMap. So let's examine these columns.

In [ ]:
x_enc[['Examide', 'Citoglipton']]

Now examine each column closely.

In [ ]:
x_enc['Examide'].value_counts()

In [ ]:
x_enc['Citoglipton'].value_counts()

As we can see, these rows have a constant value of 0.0. We can't find a correlation between constant values, and it will have zero impact on our model, so the best decision is to remove "Examide" and "Citoglipton" columns.

In [ ]:
x_enc = x_enc.iloc[:,~x_enc.columns.isin(['Examide', 'Citoglipton'])]

## 6.Tasks

In [ ]:
model = DecisionTreeClassifier

 <div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  5: </h1>

<b>Create user function that will calculate accuracy of defined classificator model.</b>

</div>

In [ ]:
def model_ac(x_enc, y_enc, clf):
    # Write your code below and press Shift+Enter to execute

<details><summary>Click <b>here</b> for the solution</summary> 

```python
    model = clf()
    model.fit(x_enc, y_enc)
    yhat = model.predict(x_enc)
    accuracy_train = accuracy_score(y_enc, yhat)
    return accuracy_train
```

</details>

In [ ]:
print('Accuracy: %.2f' % (model_ac(x_enc, y_enc, model) * 100))

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  6: </h1>

<b>Create user function that will calculate features impotance of defined classificator model.</b>

</div>

In [ ]:
def model_imp(x_enc, y_enc, clf):
    # Write your code below and press Shift+Enter to execute

<details><summary>Click <b>here</b> for the solution</summary>
    
```python
    model = clf()
    model.fit(x_enc, y_enc)
    feat_importances = pd.Series(model.feature_importances_, index=x_enc.columns)
    return feat_importances.sort_values(ascending=False)
```

</details>

In [ ]:
imp = model_imp(x_enc, y_enc, model)
print(imp)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  7: </h1>

<b>Build plot that show accuracy of defined model dependence on numbers of input features.</b>

</div>

In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click <b>here</b> for the solution</summary>
    
```python

col = []
ac = []
for c in imp.index:
    col.append(c)
    ac.append(model_ac(x_enc[col], y_enc, model))
    print('Input fields: ', len(col), 'Accuracy: %.2f' % (ac[-1] * 100))
ac = pd.Series(ac)
ac.plot()
```
    
</details>

## 7. Classification models

### Extra Trees Classifier

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  8: </h1>

<b>Build ExtraTreesClassifier model.</b>

</div>

In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click <b>here</b> for the solution</summary>

```python
model = ExtraTreesClassifier()
model.fit(x_enc, y_enc)
```
    
</details>

Evaluate the model on data to obtain predictions

In [ ]:
yhat = model.predict(x_enc)
print(yhat)

Evaluate accuracy:

In [ ]:
accuracy = accuracy_score(y_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))

### Logistic regression

As you can see Accuracy of this model is very good.

There are many different techniques for scoring features and selecting features based on scores; how do you know which one to use?

A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.

**[Logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)** is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model. We will use this model in absolutelly similar way like previous one.

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  9: </h1>

<b>Build LogisticRegression model.</b>

</div>

In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click <b>here</b> for the solution</summary>
    
```python
col = imp.nlargest(5).index
model = LogisticRegression(solver='lbfgs')
model.fit(x_enc[col], y_enc)
yhat = model.predict(x_enc[col])
accuracy = accuracy_score(y_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))
```

</details>

As you can see on this DataSet this method is less accurate.

## 8. Decision tree

### Build model

As shown, the previous methods have high for medical data accuracy. However, the biggest drawback is the inability to visualize or justify the decision.

Decision trees are a popular supervised learning method for a variety of reasons. Benefits of decision trees include that they can be used for both regression and classification, they don’t require feature scaling, and they are relatively easy to interpret as you can visualize decision trees. This is not only a powerful way to understand your model, but also to communicate how your model works. Consequently, it would help to know how to make a visualization based on your model.

A **[Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)** is a supervised algorithm used in machine learning. It is using a binary tree graph (each node has two children) to assign for each data sample a target value. The target values are presented in the tree leaves. To reach to the leaf, the sample is propagated through nodes, starting at the root node. In each node a decision is made, to which descendant node it should go. A decision is made based on the selected sample’s feature. Decision Tree learning is a process of finding the optimal rules in each internal tree node according to the selected metric.

This metod allows also to calculate features impotance.
Let's calculate them. Choice best 10 of them. Refit the model and visualize decision tree.

By, using the graph above, select max_depth to find the value that best fits DecisionTreeClassifier().
correct syntax : DecisionTreeClassifier(max_depth=value)

In [ ]:
print(imp.nlargest(5))

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h1> Question  10: </h1>

<b>Build a DecisionTreeClassifier model and fit it with the most important features from above.</b>

</div>

In [ ]:
# Write your code below and press Shift+Enter to execute

<details><summary>Click <b>here</b> for the solution</summary>
    
```python
model = DecisionTreeClassifier()
X_most_important = x_enc[col]

model.fit(X_most_important, y_enc)
yhat = model.predict(X_most_important)
accuracy = accuracy_score(y_enc, yhat)
print('Accuracy: %.2f' % (accuracy * 100))
```
    
</details>

### Visualization of decision tree

Let's visualize decision tree.
There are some ways to do it.

### _Text visualization_

In [ ]:
text_representation = tree.export_text(model)
print(text_representation)

You can save it into file:

In [ ]:
with open("decistion_tree.log", "w") as fout:
    fout.write(text_representation)

### _Plot tree_

You can plot tree using by two different way:

**[plot_tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)** (slow render - this can take some time):

In [ ]:
fig = plt.figure(figsize=(100,100))
_ = tree.plot_tree(model,
            max_depth = 7,
            feature_names = col,
            class_names = y.unique(),
            filled = True)

In [ ]:
fig.savefig('decision_tree.png')

Or you can use **[python-graphviz](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html)** library. This is more fast function

In [ ]:
import graphviz
dot_data = tree.export_graphviz(model,
               max_depth = 7,
               feature_names = col,
               class_names = y.unique(),
                                filled=True)

After creation you can draw graph

In [ ]:
graph = graphviz.Source(dot_data, format="png")
graph

And render it into file:

In [ ]:
graph.render("decision_tree_graphivz")

Now let's try out our model.

### Select patient with index 1 and using build up diagram above try to predict will patient be readmitted or not.

Num Medications <= 0.119(0.15) --> false | Diagnosis1 <= 95.5(454) --> false | Diagnosis3 <= 86.5(766) --> false | Diagnosis2 <= 78.5(78) --> true | Diagnosis2 <= 53.5(78) --> false | Num Lab Procedures <= 0.195(0.076) --> true | Num Medications <= 0.144(0.15) --> false ==> **result is <30 , in DataSet Readmitted value is NO. Prediction is wrong because tree diagram is not fully build and prediction is not as presistant as it have to.**

### Now select patient with index 0.

Num Medications <= 0.119(0.2) --> false | Diagnosis1 <= 95.5(259) --> false | Diagnosis3 <= 86.5(256) --> false | Diagnosis2 <= 78.5(246) --> false | Num Medications <= 0.469(0.2) --> true | Diagnosis1 <= 338.5(259) --> true | Diagnosis1 <= 275.5(259) --> true ==> **the result is <30, which matches the value in the dataset. So we correctly predicted the patient's value.**

## 9. Conclusions

In this lab we learned to do preliminary data processing. In particular, change data types, normalize and process categorical data. It was shown how to make feature selection by different methods. Learned how to build training and test DataSets. Shows how to work with different classifiers. It was also shown how to visualize a decision tree.
As a result of lab it was shown how on the basis of a statistical database predict.

## 10. Author

[Yaroslav Vyklyuk, prof., PhD., DrSc](http://vyklyuk.bukuniver.edu.ua/en/)

 Copyright &copy; 2021 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).