# Introduction

Hello, in this notebook I'm going to show you how to use supervised learning in the task of identifying whether it is a benign or malignant breast cancer. Below, is shown the Data Science Process. It's based in it that i'm going to teach the step by step to reach the result. 

![](https://miro.medium.com/max/3870/1*eE8DP4biqtaIK3aIy1S2zA.png)

# Code time:

# 1. Importing libraries

**scikit-learn** is the most widely used Python library for machine learning. **pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

In [None]:
import sklearn
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# 2. Loading the data


In [None]:
data = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")

# 3. Cleaning the data

In [None]:
# Checking if there are null values in the dataset
data.isnull()

In [None]:
# Deleting 'Unnamed: 32' column
data.drop("Unnamed: 32",axis=1,inplace=True)

In [None]:
# Deleting 'id' column
data.drop("id",axis=1,inplace=True)

# 4. Exploring the data

In [None]:
# Take a look to the data columns:
list(data.columns)

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
# Mapping diagnosis to integer values
data['diagnosis']=data['diagnosis'].map({'M':1,'B':0})

**The data can be divided into "mean", "se" and "worst", so below it is done:**

In [None]:
features_mean= list(data.columns[1:11])
features_se= list(data.columns[11:20])
features_worst=list(data.columns[21:31])

**Getting the frequency of the breast cancer diagnosis:**
* 1 (Malignant)
* 0 (Benign)

In [None]:
sns.set(style='darkgrid', font_scale=1.1)
sns.countplot(data['diagnosis'],label="Count")

## Analyzing data correlation
A correlation matrix is a tabular data representing the ‘correlations’ between pairs of variables in a given data.
Each row and column represents a variable, and each value in this matrix is the correlation coefficient between the variables represented by the corresponding row and column.
The Correlation matrix is an important data analysis metric that is computed to summarize data to understand the relationship between various variables and make decisions accordingly.

In [None]:
corr = data[features_mean].corr()
plt.figure(figsize=(14,14))
sns.heatmap(corr, cbar = True,  square = True, annot=True, fmt= '.2f',annot_kws={'size': 15},
           xticklabels= features_mean, yticklabels= features_mean)

In [None]:
# Based on correlation heatmap, we can select some of the variables to be used on prediction
pred_var = ['texture_mean','radius_mean','smoothness_mean','concavity_mean','symmetry_mean']

In [None]:
g = sns.PairGrid(data, y_vars=pred_var, x_vars=['diagnosis'], aspect=0.8, height=3.0)
g.map(sns.barplot, palette='muted')

# 5. Creating a model

We need to know how well it performs. To do this, the data is splitted in two parts: 1) a training dataset that we use for building the model, and 2) a test dataset that we use for testing the accuracy of our model. We do this with the use of the train_test_split function, which shuffles the dataset randomly, and by default extracts 75% of the cases as training data and 25% of the cases as test data.

In [None]:
data_target = data['diagnosis']
data_features = data.drop(['diagnosis'],axis=1)

In [None]:
# Splitting our dataset into training data and test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_features, data_target, random_state=0)

In [None]:
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)

In [None]:
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

## Model: K-Nearest Neighbours

KNN is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).

![](https://3.bp.blogspot.com/-In1TiknFHSg/XHaqqP8UzhI/AAAAAAAAGSY/0m6BSNsFKqIEDVJZyhSatsi7jL2Kb4pwwCLcBGAs/s1600/knn.jpg)

### Building a model

In [None]:
# n_neighbors=1 is setting the number of nearest neighbours to 1.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
# build the model on the training set, i.e. X_train and y_train.
knn.fit(X_train, y_train)

### Model evaluation
Testing dataset to evaluate the accuracy of the model.

In [None]:
print("KNN-1 Accuracy on training set:  {:.3f}".format(knn.score(X_train, y_train)))
print("KNN-1 Accuracy on test set: {:.3f}".format(knn.score(X_test, y_test)))

The knn model with n_neighbors=1, has accuracy 100% on the training dataset. This means that it's over-fitting the training data.

### Testing predictions using the model

In [None]:
# specify one new instance to be predicted
X_new = np.array([[18.99,
10.30,
123.8,
1001,
0.119,
0.26,
0.30,
0.15,
0.24,
0.08,
1.095,
0.9053,
8.65,
157.4,
0.0064,
0.04904,
0.05373,
0.01587,
0.03003,
0.0053,
25.38,
17.33,
186.5,
2019,
0.1642,
0.6656,
0.7119,
0.2654,
0.4601,
0.1189]])

In [None]:
prediction = knn.predict(X_new)

print(f"Prediction: {'Malignant' if prediction == 1 else 'Benign'}")

### Improving the KNN model

Trying different numbers of k nearest neighbours.

In [None]:
knn = KNeighborsClassifier(n_neighbors=4)

In [None]:
knn.fit(X_train, y_train)

In [None]:
print("KNN-4 - Accuracy on training set:  {:.3f}".format(knn.score(X_train, y_train)))
print("KNN-4 - Accuracy on test set: {:.3f}".format(knn.score(X_test, y_test)))

There was an improvement in the accuracy of the model using 4 n_neighbors instead of 1.

In [None]:
prediction = knn.predict(X_new)

print(f"Prediction: {'Malignant' if prediction == 1 else 'Benign'}")

## Model: Decision Tree

It uses a decision tree to go from observations about an item to conclusions about the item's target value.

![](https://lh4.googleusercontent.com/v9UQUwaQTAXVH90b-Ugyw2_61_uErfYvTBtG-RNRNB_eHUFq9AmAN_2IOdfOETnbXImnQVN-wPC7_YzDgf7urCeyhyx5UZmuSwV8BVsV8VnHxl1KtgpuxDifJ4pLE23ooYXLlnc)

*Basic Decision Tree example*

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)

In [None]:
print("Decision Tree - Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Decision Tree - Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

The decision tree built has accuracy 100% on the training dataset. This means that our decision tree is over-fitting the training data.
To avoid overfitting (and hopefully improve the accuracy of the model on test data), we can stop before the entire tree is created. We can do this by setting the maximal depth of the tree.

In [None]:
tree = DecisionTreeClassifier(max_depth=3, random_state=12)
tree.fit(X_train, y_train)

print("Decision Tree - Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Decision Tree - Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

The new decision tree has lower accuracy on the training dataset, but higher accuracy on the test dataset.

In [None]:
prediction = tree.predict(X_new)

print(f"Prediction: {'Malignant' if prediction == 1 else 'Benign'}")

### Improving the Decision Tree model

Trying different max_depth in the decision tree model.

In [None]:
tree = DecisionTreeClassifier(max_depth=2, random_state=12)
tree.fit(X_train, y_train)

print("Decision Tree - Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Decision Tree - Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

The new decision tree has lower accuracy on the training dataset, but higher accuracy on the test dataset. max_depth higher than this, has lesses accuracies.

## Model: Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees.

![](https://upload.wikimedia.org/wikipedia/commons/7/76/Random_forest_diagram_complete.png)


In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=1000, random_state=999, max_depth=3)
forest.fit(X_train, y_train)

In [None]:
print("Random Forest - Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Random Forest - Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))

In [None]:
prediction = forest.predict(X_new)

print(f"Prediction: {'Malignant' if prediction == 1 else 'Benign'}")

# Conclusion

We can conclude that the Random Forest model proved to be the most accurate in the classification of breast cancer.

Thank you, this is a basic notebook from the stages of machine learning, if you liked it please vote.