<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode_vertical.png" width="300" alt="cognitiveclass.ai logo"  />
</center>

# **Forecasting of Breast Cancer on medical measurement**

# Lab 4. Classification in medicine

The purpose of this lab is to master the classification of breast cancer of patients for machine learning models.

After completing this lab you will be able to:

1. preprocess (normalize and transform categorical data) and create DataSet
2. features selection
3. make classification of patients
4. visualize the decision tree of the classification model  

## Outline

* Materials and Methods
* General Part
  * Import Libraries
  * Load the DataSet
  * Data preparation
      * Data transformation
      * Encoding and Normalization
  * Features selection
      * Chi-Squared Statistic
      * Mutual Information Statistic
      * Feature Importance
      * Correlation Matrix with Heatmap
  * Decision tree 
      * Build model
      * Visualization of decision tree
  * Classification models
      * Extra Trees Classifier
      * Logistic regression 
* Authors


----

## Materials and Methods

Clinical and genomic data was downloaded from cBioPortal.
https://www.kaggle.com/datasets/gunesevitan/breast-cancer-metabric

> The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database is a Canada-UK Project which contains targeted sequencing data of 2,508 primary breast cancer samples.

Predicting breast cancer is important because early detection can greatly increase the chances of successful treatment and recovery, ultimately improving the overall health and well-being of the patient.

In this lesson, we will try to give answers to a set of questions that may be relevant when breast cancer data:

1. What are the most useful Python libraries for classification analysis?
2. How to transform category data?
3. How to create DataSet?
4. How to do features selection?
5. How to make, fit, and visualize the classification model?

In addition, we will make the conclusions from the obtained results of our classification analysis to predict the patient's vital status.

[Scikit-learn](https://scikit-learn.org/stable/) (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

In [None]:
!conda install --yes scikit-learn==0.24.2
!conda install --yes python-graphviz

## Import Libraries

Import the libraries necessary to use in this lab. We can add some aliases to make the libraries easier to use in our code and set a default figure size for further plots. Ignore the warnings.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
plt.rcParams["figure.figsize"] = (8, 6)
# Data transformation
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, StandardScaler
from sklearn.preprocessing import MinMaxScaler
# Features Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, mutual_info_classif
# Classificators
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree
# warnings deactivate
import warnings
warnings.filterwarnings('ignore')
import graphviz

## Load the Dataset

We will use the same DataSet as in previous labs. Therefore next some steps will be the same

In [None]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0OKHEN/clean_df.csv', index_col=0)
df.head(5)

In [None]:
df.shape

As you can see DataSet consists of 34 columns. The target column is: "Patient's Vital Status". Also, DataSet consists of 2509 rows. In previous labs, we investigated these columns.

<details>
<summary><b>Click to see attribute information</b></summary>
Input features (column names):

1. `Patient ID` - Patient ID (object)
2. `Age at Diagnosis` - Age of the patient at diagnosis time (numeric)
3. `Type of Breast Surgery` - Breast cancer surgery type (categorical: `Breast Conserving`, `Mastectomy`)
4. `Cancer Type Detailed` - Detailed Breast cancer types (categorical: `Breast`, `Breast Angiosarcoma`, `Breast Invasive Ductal Carcinoma`, `Breast Invasive Lobular Carcinoma`, `Breast Invasive Mixed Mucinous Carcinoma`, `Breast Mixed Ductal and Lobular Carcinoma`, `Invasive Breast Carcinoma`, `Metaplastic Breast Cancer`)
5. `Cellularity` - Cancer cellularity post-chemotherapy, which refers to the number of tumor cells in the specimen and their arrangement into clusters (categorical: `High`, `Low`, `Moderate`)
6. `Chemotherapy` - Whether or not the patient had chemotherapy as a treatment (yes/no) (boolean)
7. `Pam50 + Claudin-low subtype` - Pam 50: is a tumor profiling test that helps show whether some estrogen receptor-positive (ER-positive) and HER2-negative breast cancers are likely to metastasize (when breast cancer spreads to other organs). (categorical: `Basal`, `Her2`, `LumA`, `LumB`, `NC`, `Normal`, `claudin-low`)
8. `Cohort` - A cohort is a group of subjects who share a defining characteristic (numeric)
9. `ER status measured by IHC` - To assess if estrogen receptors are expressed on cancer cells by using immune-histochemistry (a dye used in pathology that targets specific antigens, if it is there, it will give a color, it is not there, the tissue on the slide will be colored)(categorical: `Positive`, `Negative`)
10. `ER Status` - Cancer cells are positive or negative for estrogen receptors (categorical: `Positive`, `Negative`)
11. `Neoplasm Histologic Grade` - Determined by pathology by looking at the nature of the cells, do they look aggressive or not (It takes a value from 1 to 3) (numeric).
12. `HER2 status measured by SNP6` - To assess if cancer is positive for HER2 or not by using advance molecular techniques (Type of next-generation sequencing) (categorical: `Gain`, `Loss`, `Neutral`, `Undef`)
13. `HER2 Status` - Whether the cancer is positive or negative for HER2 (categorical: `Positive`, `Negative`)
14. `Tumor Other Histologic Subtype` - Type of cancer-based on microscopic examination of the cancer tissue (categorical: `Ductal/NST`, `Lobular`, `Medullary`, `Metaplastic`, `Mixed`, `Mucinous`, `Other`, `Tubular/ cribriform`)
15. `Hormone Therapy` - Whether or not the patient had hormonal as a treatment (yes/no) (boolean)
16. `Integrative Cluster` - Molecular subtype of cancer-based on some gene expression (categorical: `1`, `2`, `3`, `4ER+`, `4ER-`, `5`, `6`, `7`,  `8`, `9`, `10`)
17. `Primary Tumor Laterality` - Whether it is involving the right breast or the left breast (categorical: `Left`, `Right`)
18. `Lymph nodes examined positive` - To take samples of the lymph node during the surgery and see if there were involved by cancer (numeric)
19. `Mutation Count` - Number of a gene that has relevant mutations (numeric)
20. `Nottingham prognostic index` - It is used to determine the prognosis following surgery for breast cancer. Its value is calculated using three pathological criteria: the size of the tumor; the number of involved lymph nodes; and the grade of the tumor. (numeric)
21. `Oncotree Code` - The OncoTree is an open-source ontology that was developed at Memorial Sloan Kettering Cancer Center (MSK) for standardizing cancer type diagnosis from a clinical perspective by assigning each diagnosis a unique OncoTree code (categorical: `BRCA`, `BREAST`, `IDC`, `ILC`, `IMMC`, `MBC`, `MDLC`, `PBS`)
22. `PR Status` - Cancer cells are positive or negative for progesterone receptors (categorical: `Positive`, `Negative`)
23. `Radio Therapy` - Whether or not the patient had radio as a treatment (yes/no) (boolean)
24. `3-Gene classifier subtype` - Three Gene classifier subtype (categorical: `ER+/HER2- High Prolif`, `ER+/HER2- Low Prolif`, `ER-/HER2-`, `HER2+`)
25. `Tumor Size` - Tumor size measured by imaging techniques (numeric)
26. `Tumor Stage` - Stage of cancer-based on the involvement of surrounding structures, lymph nodes, and distant spread (numeric)
27. `Overall Survival (Years)` - Duration from the time of the intervention to death (numeric)
28. `Relapse Free Status (Years)` - Absence of any signs or symptoms of cancer recurrence or metastasis after a patient has completed treatment for breast cancer. (numeric)
29. `Nottingham prognostic index-binned` - (categorical)
30. `Inferred Menopausal State-Post` - Whether the patient is post-menopausal or not (numeric)
31. `Inferred Menopausal State-Pre` - Whether the patient is post-menopausal or not (numeric)
32. `Relapse Free Status-Not Recurred` - Absence of any signs or symptoms of cancer recurrence or metastasis after a patient has completed treatment for breast cancer (numeric)
33. `Relapse Free Status-Recurred` - Absence of any signs or symptoms of cancer recurrence or metastasis after a patient has completed treatment for breast cancer (numeric)


Output feature (desired target):

34. `Patient's Vital Status` - Patient's Vital Status (categorical: `Died of Disease`,`Died of Other Causes`, `Living`)
</details>

Our goal is to create a classification model that can predict a Patient's Vital Status. To do this we must analyze and prepare data for such type of model.

## Data preparation

### Data transformation

First of all we should investigate how pandas recognized types of features

In [None]:
df.info()

As you can see all categorical features were recognized like objects. We must change their type on "categorical". 

In [None]:
col_cat = list(df.select_dtypes(include=['object']).columns)
# Remove the first column "Patient ID"
col_cat = col_cat[1:]
col_cat

Let's look at the DataSet size.

In [None]:
df.loc[:, col_cat] = df[col_cat].astype('category')
df.info()

To see the unique values of exact feature (column) we can use:

In [None]:
df['Type of Breast Surgery'].unique()

As was signed earlier the dataset contains 2509 objects (rows), for each of which 34 features are set (columns), including 1 target feature (Patient's Vital Status). 33 features, including the target, are categorical. These data types of values cannot use for classification. We must transform it to int or float.
To do this we can use **[LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)** and **[OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)**. These functions can encode categorical features as an integer array.

First of all we separate DataSet on input and output(target) DataSets

In [None]:
X = df.drop(["Patient's Vital Status", "Patient ID"], axis=1) #input columns
y = df["Patient's Vital Status"] #target column


Let's create a list of boolean fields.

In [None]:
col_bool = list(df.select_dtypes(include=['bool']).columns)
col_bool

### Encoding and Normalization

Let's look at the value of the target column

In [None]:
y.value_counts()

Then create a list of categorical fields and transform their values into int arrays:

In [None]:
col_cat = df.loc[:, ~df.columns.isin(["Patient's Vital Status", "Patient ID"])].select_dtypes(include=['category']).columns
oe = OrdinalEncoder()
oe.fit(X[col_cat])
X_cat_enc = oe.transform(X[col_cat])

In [None]:
X_cat_enc

Than we must transform arrays back into DataFrame:

In [None]:
X_cat_enc = pd.DataFrame(X_cat_enc)
X_cat_enc.columns = col_cat
X_cat_enc

Numerical fields can have a different scale and can consist of negative values. These will lead to round mistakes and exceptions for some AI methods. To avoid it these features must be normalized.

Let's create a list of numerical fields and normalize it using **[MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)**

In [None]:
col_num = df.select_dtypes(include=['int64', 'float64']).columns

scaler = MinMaxScaler(feature_range=(0, 1))
X_num_enc = scaler.fit_transform(X[col_num])

In [None]:
X_num_enc

Like in the previous case transform back obtained arrays into DataFrame

In [None]:
X_num_enc = pd.DataFrame(X_num_enc)
X_num_enc.columns = col_num
X_num_enc

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3>Question  #1:</h3>

<b>Transform the list of boolean and their values to int arrays: </b>

</div>

In [None]:
# Write your code below and press Shift+Enter to execute

oe = OrdinalEncoder()
oe.fit(X[col_bool])
X_bool_enc = oe.transform(X[col_bool])

<details><summary>Click here for the solution</summary>

```python
oe = OrdinalEncoder()
oe.fit(X[col_bool])
X_bool_enc = oe.transform(X[col_bool])
```

</details>

In [None]:
col_bool

In [None]:
X_bool_enc

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3>Question  #2:</h3>

<b>Transform arrays back into DataFrame: </b>

</div>

In [None]:
# Write your code below and press Shift+Enter to execute

X_bool_enc = pd.DataFrame(X_bool_enc)
X_bool_enc.columns = col_bool
X_bool_enc

<details><summary>Click here for the solution</summary>

```python
X_bool_enc = pd.DataFrame(X_bool_enc)
X_bool_enc.columns = col_bool
X_bool_enc
```

</details>

Then we should concatenate these DataFrames in one input DataFrame

In [None]:
x_enc = pd.concat([X_cat_enc, X_num_enc, X_bool_enc], axis=1)
x_enc

The same transformation we must do for the target field

In [None]:
le = LabelEncoder()
le.fit(y)
y_enc = le.transform(y)
y_enc = pd.Series(y_enc)
y_enc.columns = y.name

In [None]:
y

In [None]:
y_enc

As you can see values 'Died of Disease' was changed to 0, 'Died of Other Causes' to 1, and 'Living' to 2.

## Features selection

As was signed before input fields consist of 32 features. Of course, some of them are more significant for classification.

Two popular feature selection techniques can be used for categorical input data and a categorical (class) target variable.

They are:

* Chi-Squared Statistic.
* Mutual Information Statistic.

Let’s take a closer look at each in turn.

To do this we can use **[SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)**

### Chi-Squared Statistic

Pearson’s chi-squared statistical hypothesis test is an example of a test for independence between categorical variables.

You can learn more about this statistical test in the tutorial:

[A Gentle Introduction to the Chi-Squared Test for Machine Learning](https://machinelearningmastery.com/chi-squared-test-for-machine-learning/)
The results of this test can be used for feature selection, where those features that are independent of the target variable can be removed from the dataset.

The scikit-learn machine library provides an implementation of the chi-squared test in the **[chi2()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2)** function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

For example, we can define the SelectKBest class to use the chi2() function and select all (or most significant) features.

Apply SelectKBest class to extract top 10 best features

In [None]:
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(x_enc, y_enc)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

concat two DataFrames for better visualization 

In [None]:
featureScores = pd.concat([dfcolumns, dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(round(featureScores.nlargest(10,'Score'),2))  #print 10 best features

### Mutual Information Statistic

Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

[You can learn more about mutual information in the following tutorial.](https://machinelearningmastery.com/information-gain-and-mutual-information)

The scikit-learn machine learning library provides an implementation of mutual information for feature selection via the **[mutual_info_classif()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif)** function.

Like chi2(), it can be used in the SelectKBest feature selection strategy (and other strategies).

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3>Question  #3:</h3>

<b>Define the SelectKBest class to use the mutual_info_classif() function and select all (or most significant) features. </b>

</div>

In [None]:
# Write your code below and press Shift+Enter to execute

bestfeatures = SelectKBest(score_func=mutual_info_classif, k=10)
fit = bestfeatures.fit(x_enc, y_enc)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
featureScores = pd.concat([dfcolumns, dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(round(featureScores.nlargest(10,'Score'),2))  #print 10 best features

<details><summary>Click here for the solution</summary>

```python
bestfeatures = SelectKBest(score_func=mutual_info_classif, k=10)
fit = bestfeatures.fit(x_enc, y_enc)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
featureScores = pd.concat([dfcolumns, dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(round(featureScores.nlargest(10,'Score'),2))  #print 10 best features
```

</details>

As you can see these 2 functions select different significant features.

### Feature Importance

You can get the feature importance of each feature of your DataFrame by using the feature importance property of the exact classification model.
Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.
For example:
Feature importance is an inbuilt class that comes with **[Tree Based Classifiers](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)**, we will be using Extra Tree Classifier for extracting the top 10 features for the dataset.

Let's create and fit the model:

In [None]:
model = ExtraTreesClassifier()
model.fit(x_enc, y_enc)

use inbuilt class feature_importances of tree-based classifiers

In [None]:
print(model.feature_importances_)

Let's transform it into a Series and plot a graph of feature importances for better visualization

In [None]:
feat_importances = pd.Series(model.feature_importances_, index=x_enc.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

You can see that for the Extra Tree Classifier importance of features are different than in previous cases. It means that there are no exact rules for features selection. And their importance strictly dependence on model.

### Correlation Matrix with Heatmap

Correlation states how the features are related to each other.
Correlation can be positive (an increase in one value of a feature increases the value of the other variable) or negative (an increase in one value of a feature decreases the value of the other variable)
Heatmap makes it easy to identify which features are most related to the other variable, we will plot heatmap of correlated features using the Seaborn library.

In [None]:
corrmat = x_enc.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
g=sns.heatmap(x_enc[top_corr_features].corr(),annot=True,cmap="RdYlGn")

As you can see fields 'Inferred Menopausal State-Post', 'Inferred Menopausal State-Pre', 'Relapse Free Status-Not Recurred', 'Relapse Free Status-Recurred' and 'HER2 status measured by SNP6', 'HER2 Status' strictly correlate each other. It means that three of them must be removed from the calculation because there are linear dependencies between them. If we know one of them we can easily calculate another three. Let's remove 'Inferred Menopausal State-Pre', 'Relapse Free Status-Recurred', and 'HER2 Status'.

In [None]:
col = x_enc.columns
col

In [None]:
col = ['Type of Breast Surgery', 'Cancer Type Detailed', 'Cellularity',
       'Pam50 + Claudin-low subtype', 'ER status measured by IHC',
       'HER2 status measured by SNP6', 'Tumor Other Histologic Subtype',
       'Integrative Cluster', 'Primary Tumor Laterality', 'Oncotree Code',
       'PR Status', '3-Gene classifier subtype',
       'Nottingham prognostic index-binned', 'Age at Diagnosis',
       'Cohort', 'Neoplasm Histologic Grade',
       'Lymph nodes examined positive', 'Mutation Count',
       'Nottingham prognostic index', 'Tumor Size', 'Tumor Stage',
       'Overall Survival (Years)', 'Relapse Free Status (Years)',
       'Inferred Menopausal State-Post', 'Relapse Free Status-Recurred',
       'Chemotherapy', 'Hormone Therapy', 'Radio Therapy']

In [None]:
x_enc = x_enc[col]

In [None]:
x_enc

In [None]:
col

## Classification models

## Decision tree

### Build model

The biggest drawback is the inability to visualize or justify the decision.

Decision trees are a popular supervised learning method for a variety of reasons. The benefits of decision trees include that they can be used for both regression and classification, they don’t require feature scaling, and they are relatively easy to interpret as you can visualize decision trees. This is not only a powerful way to understand your model, but also to communicate how your model works. Consequently, it would help to know how to make a visualization based on your model.

A **[Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)** is a supervised algorithm used in machine learning. It is using a binary tree graph (each node has two children) to assign for each data sample a target value. The target values are presented in the tree leaves. To reach the leaf, the sample is propagated through nodes, starting at the root node. In each node, a decision is made, as to which descendant node it should go. A decision is made based on the selected sample’s features. Decision Tree learning is a process of finding the optimal rules in each internal tree node according to the selected metric.

Let's calculate the feature importance, choose the top 10 features, refit the model, and visualize the decision tree continuously.

In [None]:
model_dec = DecisionTreeClassifier()
model_dec.fit(x_enc, y_enc)
yhat = model_dec.predict(x_enc)
accuracy = accuracy_score(y_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Let's create a user function that calculates the accuracy of a defined classifier model

In [None]:
def model_ac(x, y, clf):
    model_dec.fit(x, y)
    yhat = model_dec.predict(x)
    accuracy = accuracy_score(y, yhat)
    return accuracy

Now, we need to create a user function that will calculate feature importance of a defined classifier model

In [None]:
def model_imp(x, y, model_dec):
    feat_importances = pd.Series(model_dec.feature_importances_, index=x.columns)
    return feat_importances.sort_values(ascending=False)

We can see features sorted by importance in descending order.

In [None]:
imp = model_imp(x_enc, y_enc, model_dec)
print(imp)

Plot graph of feature importances for better visualization

In [None]:
imp.nlargest(10).plot(kind='barh')
plt.show()

Let's build a plot that shows the accuracy of a defined model as a function of the number of input features

In [None]:
col = []
ac = []
for c in imp.index:
    col.append(c)
    ac.append(model_ac(x_enc[col], y_enc, model))
    print('Input fields: ', len(col), 'Accuracy: %.2f' % (ac[-1]*100))
ac = pd.DataFrame(ac)
ac.plot()

We can see that 3 features is enough to make 100% accuracy. So let's create list of this 3 features in order to use them for our next classification models.

In [None]:
col = imp.nlargest(3).index
col

Let's refit the model on most important features

In [None]:
X_most_imp = x_enc[col]
model_dec.fit(X_most_imp, y)
yhat = model_dec.predict(X_most_imp)
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.2f' % (accuracy*100))

### Extra Trees Classifier

Let's create and fit ExtraTreesClassifier on train DataSet and calculate the accuracy of classification:

In [None]:
model = ExtraTreesClassifier()
model.fit(X_most_imp, y_enc)

Evaluate the model on test data to obtain predictions

In [None]:
yhat = model.predict(X_most_imp)
print(yhat)

Evaluate accuracy: 

In [None]:
accuracy = accuracy_score(y_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))

In [None]:
imp = model_imp(X_most_imp, y, model)
print(imp)

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3>Question  #4:</h3>

<b>Build a plot that shows the accuracy of a defined model as a function of the number of input features (using most important features) </b>

</div>

In [None]:
# Write your code below and press Shift+Enter to execute

col = []
ac = []
for c in imp.index:
    col.append(c)
    ac.append(model_ac(X_most_imp[col], y_enc, model))
    print('Input fields: ', len(col), 'Accuracy: %.2f' % (ac[-1]*100))
ac = pd.Series(ac)
ac.plot()

<details><summary>Click here for the solution</summary>

```python
col = []
ac = []
for c in imp.index:
    col.append(c)
    ac.append(model_ac(X_most_imp[col], y_enc, model))
    print('Input fields: ', len(col), 'Accuracy: %.2f' % (ac[-1]*100))
ac = pd.Series(ac)
ac.plot()
```

</details>

### Logistic regression 

As you can see Accuracy of this model is very good.

There are many different techniques for scoring features and selecting features based on scores; how do you know which one to use?

A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.

**[Logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)** is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model. We will use this model in absolutelly similar way like previous one.

In [None]:
model = LogisticRegression(solver='lbfgs')
model.fit(X_most_imp, y_enc)
yhat = model.predict(X_most_imp)
accuracy = accuracy_score(y_enc, yhat)
print('Accuracy: %.2f' % (accuracy*100))

As we can see, the accuracy of the Logistic Regression model is lower (about 72%).

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3>Question  #5:</h3>
    
<b>Calculate the accuracy of the Logistic Regression model using all features</b>

</div>

In [None]:
# Write your code below and press Shift+Enter to execute

model = LogisticRegression(solver='lbfgs')
model.fit(x_enc, y)
yhat = model.predict(x_enc)
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.2f' % (accuracy*100))

### Visualization of decision tree

Let's visualize the decision tree.
There are some ways to do it. 

Since when building the decision tree we will have many fields and because of this nothing will be visible, so we will set the limit max_depth = 3

In [None]:
model_dec = DecisionTreeClassifier(max_depth = 3)
model_dec.fit(X_most_imp, y)
yhat = model_dec.predict(X_most_imp)
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Accuracy is quite high.

### _Text visualization_

In [None]:
text_representation = tree.export_text(model_dec, feature_names=list(X_most_imp.columns))
print(text_representation)

You can save it into the file:

In [None]:
with open("decistion_tree.log", "w") as fout:
    fout.write(text_representation)

### _Plot tree_

You can plot a tree using two different ways:

**[plot_tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)** (slow render - this can take some time): 


In [None]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(model_dec,
               feature_names = col,
               filled = True)

In [None]:
fig.savefig('decision_tree.png')

Or you can use **[python-graphviz](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html)** library. This is more fast function

In [None]:
dot_data = tree.export_graphviz(model_dec,
               feature_names = col,
               filled=True)

After creation, you can draw a graph

In [None]:
graph = graphviz.Source(dot_data, format="png") 
graph

We can see that we have normalized data on the decision tree. An ordinary person will not be able to understand such data. So let's rebuild the decision tree and look at the real data.

In [None]:
model_real = DecisionTreeClassifier(max_depth = 3)
X_most_imp_real = df[X_most_imp.columns]

model_real.fit(X_most_imp_real, y)
yhat = model_real.predict(X_most_imp_real)
accuracy = accuracy_score(y, yhat)
print('Accuracy: %.2f' % (accuracy*100))

The accuracy is the same.

In [None]:
text_representation = tree.export_text(model_real, feature_names=list(X_most_imp_real.columns))
print(text_representation)

Save to a file

In [None]:
with open("decistion_tree.log", "w") as fout:
    fout.write(text_representation)

We build a decision tree using **[plot_tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)**

In [None]:
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(model_real,
               feature_names = list(X_most_imp_real.columns),
               filled = True)

In [None]:
fig.savefig('decision_tree.png')

Now build a decision tree using **[python-graphviz](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html)**

In [None]:
dot_data = tree.export_graphviz(model_real,
               feature_names = list(X_most_imp_real.columns),
               filled=True)

Let's draw a graph

In [None]:
graph = graphviz.Source(dot_data, format="png") 
graph

Now, let's test our decision tree.

For the test, we will take three patients with indices 1, 28, 40: 

Patient with index 1: Relapse Free Status-Recurred <= 0.50 - TRUE | Age at Diagnosis <= 64.01 - TRUE | Age at Diagnosis <= 53.76 - TRUE -> Result is **Living** in DataSet Patient's Vital Status value also **Living**.

Patient with index 28: Relapse Free Status-Recurred <= 0.50 - FALSE | Overall Survival (Years) <= 10.43 - TRUE | Age at Diagnosis <= 59.26 - TRUE - > Result is **Died of Disease** in DataSet Patient's Vital Status value also **Died of Disease**.

Patient with index 40: Relapse Free Status-Recurred <= 0.50 - FALSE | Overall Survival (Years) <= 10.43 - TRUE | Age at Diagnosis <= 59.26 - TRUE - > Result is **Died of Disease** in DataSet Patient's Vital Status value also **Died of Disease**.

And render it into the file:

In [None]:
graph.render("decision_tree_graphivz")

Save the new csv:

In [None]:
df = df.drop(["Patient ID", "ER Status", "Inferred Menopausal State-Pre", "Relapse Free Status-Recurred", "HER2 Status"], axis=1)
df.to_csv('breast_cancer.csv', index=False)

## Conclusions

In this lab, we learned to do preliminary data processing. In particular, change data types, and normalize and process categorical data. It was shown how to make feature selections by different methods. Learned how to build training and test DataSets. Shows how to work with different classifiers. It was also shown how to visualize a decision tree.
As a result of the lab, it was shown how based on a statistical database predict Patient Vital Status.

The Decision Tree and Extra Tree classifiers are highly accurate models for the given data. The accuracy of these models was found to be 100%, indicating that they are capable of accurately classifying the data. However, the accuracy of the Logistic Regression model was lower, at about 80%

### Thank you for completing this lab!

## Author

<a href="https://author.skills.network/instructors/dmytro_shliakhovskyi">Dmytro Shliakhovskyi</a>

### Other Contributors

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/nataliya_boyko">Ass. Prof. Nataliya Boyko, PhD</a>


## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                                         |
| ----------------- | ------- | ---------- | ---------------------------------------------------------- |
|    2023-03-18     | 01 | Dmytro Shliakhovkyi | Lab created |



<hr>

## <h3 align="center"> © IBM Corporation 2020. All rights reserved. <h3/>