# Breast Cancer Detection with Decision Trees

A decision tree is a logic structure composed of connected nodes. Data enters at the root of the tree, and logical decisions are made about that data at each node. They're a simple, general concept easy to grasp which is a benefit for operational use.

Therefore, by making our way down the decision tree, we can apply classifiers to data. In the diagram below, the input data are "job offers," and we are classifying each offer as "decline" or "accept."

![](https://i0.wp.com/dataaspirant.com/wp-content/uploads/2017/01/B03905_05_01-compressor.png?resize=768%2C424&ssl=1)


Everything starts at the top of the decision tree from the root node.

Each of the internal decision nodes asks a question about the data, and determines which branch to continue down. Notice that all decisions must result in a logical "yes" or "no."
Ex. Instead of asking how long a commute time is, ask if it was over 1 hour
Each branch line denotes the outcome of a test (yes/no)
Finally, the data reaches an end, or a leaf node, which contains a classification label.



![](https://miro.medium.com/max/2400/1*UCd6KrmBxpzUpWt3bnoKEA.png)

**Overfitting** is when a machine learning algorithm has become too specific to the training set. For example, a decision tree could have so many nodes that each piece of data in the training set will end at a different leaf. Unless each new datum is identical to one in the training set, such a tree would have innacurate results. While not always to this extreme, decision trees have a tendancy to overfit because of the granular nature of decision nodes.**

**Feature selection** will be focused on in this notebook!
Other important parameters of decision trees include maximum depth and minimum sample leafs.[](http://)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns # data visualization library  
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
data.head()

Get rid of unnecessary columns
Looks like we can get rid of id, diagnosis, and Unnamed: 32

In [None]:
y = data.diagnosis                          # M or B 
list = ['Unnamed: 32','id','diagnosis'] #makes a list of unnecessary columns
x = data.drop(list,axis = 1 ) #deletes those columns
x.head()

# Data visualization
We're gonna skip other data visualization, besides heatmap, which visualizes correlation.

In [None]:
#correlation map
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(x.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)


# Time for Decision Trees!

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.metrics import accuracy_score

# split data train 70 % and test 30 %
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# Univariate feature selection and decision tree classification

Instead of hand picking features, we're gonna use a sklearn method called 'SelectKBest', to do it for us, statistically. This will pick out k number of informative features. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

There are other ways like feature selection with correlation, recursive feature elimination, tree based feature selection, principle component analysis (PCA), etc. 
@DATAI's notebook on "Feature Selection and Data Visualization" for this dataset is helpful for reference: https://www.kaggle.com/kanncaa1/feature-selection-and-data-visualization#Feature-Selection-and-Random-Forest-Classification

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

# find best scored 5 features
select_feature = SelectKBest(chi2, k=5).fit(x_train, y_train)
print(x_train.shape)

In [None]:
x_train = select_feature.transform(x_train)
x_test = select_feature.transform(x_test)
print(x_train.shape)


#random forest classifier with n_estimators=10 (default)
model = DecisionTreeClassifier()      
model = model.fit(x_train,y_train)

# Model evaluation
![](https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png)


Confusion matrix helps visualize your model's predictions against the actual diagnosis. 

In [None]:
#print accuracy
ac_2 = accuracy_score(y_test,model.predict(x_test))
print('Accuracy is: ',ac_2)

#print confusion matrix
cm_2 = confusion_matrix(y_test,model.predict(x_test))
sns.heatmap(cm_2,annot=True,fmt="d")

You want to maximize the diagonal from top left to bottom right, which is the true positive data (predicted malignant when there is breast cancer and benign when there isn't breast cancer).

# Visualize your decision tree

In [None]:
import graphviz
from sklearn import tree

dot_data = tree.export_graphviz(model, out_file=None, filled=True)

graph = graphviz.Source(dot_data)
graph

# Pros and Cons of Decision Trees

Pros:
Easy to understand and interpret, perfect for visual representation.
Can work with numerical and categorical features.

Cons:
Tends to overfit

You can explore other parameters of DecisionTreeClassifier here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In addition, try exploring RandomForestClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html