
# Classification on Wisconsin Breast Cancer data (with Python)


In this notebook, we will explore some of the basic capabilities of Python's **scikit-learn** package for the data science's role to work with classification datasets. For numerical analysis of tabular data, we shall use the Pandas package, which includes specific data types and functions for working with two-dimensional tables of data in Python. The Pandas package offers a more convenient structure to work with data - the DataFrame.

*Supervised* machine learning techniques involve in training a machine learning model that utilizes a set of *features* to predict a *label* using a dataset that includes already-known label values. This can be mathematically formulated as
$$y = f([x_1, x_2, x_3, ...]),$$
where $f$ represents a function that maps the features to the label.

*Classification* is a form of supervised machine learning in which you train a model to use the features (the ***x*** = $[x_1, x_2, x_3, ...]$ values in our function) to predict a label (***y***) that calculates the probability of the observed case belonging to each of a number of possible classes, and predicting an appropriate label. The simplest form of classification is *binary* classification, in which the label is 0 or 1, representing one of two classes; for example, "True" or "False"; "Risk" or "No-risk"; "Profitable" or "Non-Profitable"; and so on.

In this tutorial, we shall use the Breast Cancer Wisconsin (Diagnostic) Dataset, extracted from [kaggle](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data). The goal of this dataset is to predict whether the cancer of patient is benign or malignant. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image and the 3-dimensional space that it occupies (K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34).

Contents:
- Explore and preprocess data
- Split data into training data and test data
- Train classification models using **scikit-learn** machine learning models
- Save your model and inference new cases.
- Appendix 1: Other file formats

What you will learn:
- Exploring and preprocessing data for training classification machine learning models.
- Explore the different off-the-shelf classification machine learning models of **scikit-learn**.
- Save the trained machine learning model and import it to make new predictions.


Source:
- [Microsoft's ml-basics tutorials](https://github.com/MicrosoftDocs/ml-basics)
- [Pandas documentation](https://pandas.pydata.org/docs/)
- [Matplotlib documentation](https://matplotlib.org/3.3.2/contents.html)
- [User guide of Scikit-learn](https://scikit-learn.org/stable/user_guide.html)

## Explore and preprocess data

Let us import the Breast Cancer Wisconsin Dataset. The dataset is saved in the folder *online-data* in *csv-format*. This is a common data format where the information is delimited using a symbol such as **,** or **;**.

To import this data as a Pandas DataFrame into the memory of Python, the **read_csv** method can be used from the Pandas package. In this method, you need to provide which **delimiter** that is used in the dataset and whether a **header** is present. The header can contain schema information about what the numbers of the data represents. More information on Pandas DataFrame can be found in the [Pandas documentation](https://pandas.pydata.org/docs/).

Remark:
- To import data from other file formats, please look into Appendix 1.

In [None]:
import pandas as pd

# load the training dataset
data = pd.read_csv('wisconsin_data.csv', delimiter=',', header='infer')
data

We observe that we have data on 569 patients and that the first two columns contains information on
- ID number of the patient
- Cancer Diagnosis (M = malignant, B = benign).

Also, ten real-valued features are computed for each cell nucleus:
- Radius (mean of distances from center to points on the perimeter)
- Texture (standard deviation of gray-scale values)
- Perimeter
- Area
- Smoothness (local variation in radius lengths)
- Compactness (perimeter^2 / area - 1.0)
- Concavity (severity of concave portions of the contour)
- Concave points (number of concave portions of the contour)
- Symmetry
- Fractal dimension ("coastline approximation" - 1).

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.

The first step in any machine learning project is to explore the data that you will use to train a model. The goal of this exploration is to try to understand the relationships between its attributes; in particular, any apparent correlation between the *features* and the *label* your model will try to predict. This may require some work
- to detect and fix issues in the data (such as dealing with missing values, errors, or outlier values),
- deriving new feature columns by transforming or combining existing features (a process known as *feature engineering*),
- *normalizing* numeric features (values you can measure or count) so they're on a similar scale,
- and *encoding* categorical features (values that represent discrete categories) as numeric indicators.

For example, we observe that the last column of the dataset contains missing values from a mistake when importing the data. Also, the patient *id* cannot possibly predict the diagnosis label. Therefore, we will remove the first and last column using the following code cell. This can be done by using the **iloc** method.

In [None]:
# Remove the first and last column
data2 = data.iloc[:, 1:32]
data2.head()

We can visualize the distribution of the features of the dataset using a boxplot, sorted by each value of the label. Let us do this only for the mean values of the features,

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

# Defining the features to visualize
features = ["radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean","compactness_mean","concavity_mean","concave points_mean","symmetry_mean","fractal_dimension_mean"]

# Visualize using the boxplot method of matplotlib, sorted by the label values.
for col in features:
    data2.boxplot(column=col, by='diagnosis', figsize=(6,5))
    plt.title(col)
plt.show()

This is useful information. We observe that *radius_mean*, *texture_mean*, *perimeter_mean*, *area_mean*, *smoothness_mean*, *compactness_mean*, *concavity_mean* and *concave points_mean* show a difference in its distribution when the diagnosis is benign compared to when the diagnosis is malignant (the median of these mean features are higher when the diagnosis is malignant). Thus, we can be convinced that these features can be useful when predicting the diagnosis label. For *summetry_mean* and *fractal_dimension_mean*, this is not so obvious.

This analysis can also be repeated for the ommited features of the dataset.

## Split data into training data and test data

Let us now do prepare the dataset for training a machine learning model. We can split the dataset into a dataset with the features ***X*** and the label label ***y***, by locating their column numbers.

In [None]:
# Split by features and label
features_columns = range(1, data2.shape[1])
label_column = 0
X, y = data2.iloc[:, features_columns], data2.iloc[:, label_column]

# Check the results
for i in range(0,4):
    print("Patient", str(i+1), "\n  Features:",list(X.iloc[i]), "\n  Label:", y.iloc[i], "\n")

In machine learning, we construct a training dataset and a test dataset. The training dataset is, as the name suggests, for optimizing the untrained machine learning model into a machine learning model that recognizes the underlying relations and patterns of the dataset. The model is said to be *fitted* to the dataset. However, to evaluate the performance of the trained model, we measure its performance on a separate dataset, the test dataset.

In the python **scikit-learn** package, we can use a **train_test_split** method that ensures we get a statistically random split of training and test data. We'll use that to split the data into 80% for training and hold back 20% for testing.

In [None]:
from sklearn.model_selection import train_test_split

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

print ('Training cases: %d\nTest cases: %d' % (X_train.shape[0], X_test.shape[0]))

## Train classification models using **scikit-learn** machine learning models

One of the machine learning models that can be used for this classification dataset is the off-the-shelf *Logistic Regression* model of **scikit-learn** package. This model asks for a *regularization* parameter *C*. Without diving too much into the mathematics, *C* helps to generalize the model and avoids the phenomenon of *overfitting* to the training data.

Data Scientist refer to parameters such as *C* as a *hyperparameter*, a number that you have to provide to the model, before it trains its *parameters* (the unknown numbers of the model). For simplicity, we shall fix *C* to a chosen number. In practice, it is best to try a few hyperparameter values and optimize for the best trained model. The Logistic Regression model can be *fitted* by calling its **fit** method, like this,

In [None]:
from sklearn.linear_model import LogisticRegression

# Set regularization hyperparameter
C = 100

# train a logistic regression model on the training set
model_log = LogisticRegression(C=C, solver="liblinear").fit(X_train, y_train)
print(model_log)

Now that we have a trained model, we can use it to predict our test dataset using its **predict** method, like this,

In [None]:
predictions = model_log.predict(X_test)

print('Predicted labels: ', predictions)
print('Actual labels:    ', y_test.to_list())

## Classification metrics

We could compare each label value that was predicted with the actual label value, but that would be time consuming and not a good way to quantify the performance of the trained model. Several machine learning model metrics for classification to quantify the performance of the trained model are:
- *Accuracy*: What proportion of the labels did the model predict correctly?
- *Precision*: Of the predictons the model made for this class, what proportion were correct?
- *Recall*: Out of all of the instances of this class in the test dataset, how many did the model identify?
- *F1-Score*: An average metric that takes both precision and recall into account.
- *Support*: How many instances of this class are there in the test dataset?

For the accuracy metric, we can use the **accuracy_score** method of the **scikit-learn** package. For the other metrics, we can use the **classification_report** method.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

print("Accuracy score: ", accuracy_score(y_test, predictions))
print("Classification report: \n", classification_report(y_test, predictions))

From these numbers we can make some statements, such as:
- "From all the patients that have cancer diagnosis benign, 94% of them are classified correcly using the trained model." (from the recall of benign diagnosis)
- "From all the malignant cancer diagnosis of the trained model, 92% of them are correcly classified." (from the precision of the malignant diagnosis)

These statements are useful when reporting the performance of the trained machine learning model (e.g. machine learning applications in the public health sector).

Mathematically, the precision and recall are calculated using the following quantities:

* *True Positives*: The predicted label and the actual label are both 1.
* *False Positives*: The predicted label is 1, but the actual label is 0.
* *False Negatives*: The predicted label is 0, but the actual label is 1.
* *True Negatives*: The predicted label and the actual label are both 0.

In our dataset, *1* could correspond to malignant, while *0* to benign. We could visualize these quantities in a *confusion matrix*. This can be easily calculated using the **confusion_matrix** method of the **scikit-learn** package.

In [None]:
from sklearn.metrics import confusion_matrix

# Calculate and display the confusion matrix
m = confusion_matrix(y_test, predictions)
print(m)

We can also visualize this confusion matrix using the **plot_confusion_matrix** of **scikit-learn** package,

In [None]:
from sklearn.metrics import plot_confusion_matrix

# Plot the confusion matrix
m = plot_confusion_matrix(model_log, X_test, y_test)

We observe that the values of the confusion matrix are color-coded and a corresponding legend is provided.

Statistical machine learning algorithm, such as *LogisticRegression* work with *probability*. The predicted class labels using the trained model are assigned to a certain value based on a given threshold. For example, A threshold of 0.5 will cause that the label 1 is predicted when *P(y) > 0.5* or the label 0 when *P(y) <= 0.5*, where *P(y)* denotes the probability that is returned by the model.

These probailities can be returned explicitly using the **predict_proba** method of the trained model.

In [None]:
y_prob = model_log.predict_proba(X_test)
print(y_prob)

We observe that for the first test data instance, The trained model predict a 99.79% probability that it has label 'malignant'.

A consequence is, depending on the given threshold, different classes can be assigned on the final classification predictions. A way to quantify the performance of the trained classification model, regardless of the given threshold, is by using a *received operator characteristic (ROC) chart*, which plots the *true positive rate* (TPR) against the *false positive rate* (FPR):

$$ TPR = \frac{TP}{TP + FN},$$

$$ FPR = \frac{FP}{FP + TN}.$$

This can be plotted using the **roc_curve** of the **scikit-learn** package.

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Replace the y_test values into 1s and 0s using a dictionary
di = {"M": 1, "B": 0}
y_test2 = y_test.map(di)

# Calculate ROC curve, select the 1 convention of y_prob (in this case, 1 corresponds to malignant)
fpr, tpr, thresholds = roc_curve(y_test2, y_prob[:,1])

# Plot ROC curve
fig = plt.figure(figsize=(6, 6))

# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')

# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve -- LogisticRegression')
plt.show()

The curve on the *ROC-curve* represent the TPR and the FPR of every possible chosen threshold for the trained model. The straight line is a reference and represent a model that guess the label classes randomly (we want to perform better than this). The more the ROC-cure is curved above this reference line, the better the model. For more info, please visit this [wikipedia page](https://en.wikipedia.org/wiki/Receiver_operating_characteristic).

A performance metric from the ROC-curve is to calculate the area under the curve (AUC), which is a number between 0 and 1. The higher this number, the better your trained model regardless of the chosen threshold. This can be calculated using the **roc_auc_score** method from **scikit-learn**. We shall use this method to get the AUC and save it in a table for later reference.

In [None]:
from sklearn.metrics import roc_auc_score

# Calculate AUC
auc = roc_auc_score(y_test2,y_prob[:,1])

# Put it in a table for later reference
metrics = pd.DataFrame(index = ["AUC"], columns = ["LogisticRegression"])
metrics.iloc[0,0] = auc

metrics

## Try other classification models

Now let us try a different algorithm other than *LogisticRegression*. There are many kinds of classification algorithm we could try, such as:

- *Support Vector Machine (SVM) algorithms*: Algorithms that define a *hyperplane* that separates classes.
- *Tree-based algorithms*: Algorithms that build a decision tree to reach a prediction
- *Ensemble algorithms*: Algorithms that combine the outputs of multiple base algorithms to improve generalizability.

Let us try a classification model using a Support Vector Machine algorithm. For this algorithm, we must provide a *kernel function*. It can be seen as another hyperparameter that you have to provide to the model before training. For more details, see the [Scikit-Learn documentation](https://scikit-learn.org/stable/modules/svm.html).

For simplicity let us train a linear SVM (with a linear kernel), plot the ROC-curve and store its AUC.

In [None]:
from sklearn.svm import LinearSVC

# Set regularization hyperparameter
C = 100

# train a linear SVM model on the training set
model = LinearSVC(C=C).fit(X_train, y_train)
print(model)

# Calculate AUC and store
y_prob = model.decision_function(X_test)
auc = roc_auc_score(y_test2, y_prob)

# Plot ROC
fpr, tpr, thresholds = roc_curve(y_test2, y_prob)
fig = plt.figure(figsize=(6, 6))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve -- LinearSVC')
plt.show()

# Concatinate AUC
m = pd.Series([auc])
m.index = ["AUC"]
metrics = pd.concat([metrics, m.rename("LinearSVC")], axis=1)
metrics

As an alternative, there's a category of algorithms for machine learning that uses a tree-based approach in which the features in the dataset are examined in a series of evaluations, each of which results in a *branch* in a *decision tree* based on the feature value. At the end of each series of branches are leaf-nodes with the predicted label value based on the feature values.

It's easiest to see how this works with an example. Let us train a Decision Tree classification model using the Wisconsin Breast Cancer data. After training the model, the code below will print the model definition and a text representation of the tree it uses to predict label values.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text

# train a Decision Tree Classifier model on the training set
model = DecisionTreeClassifier().fit(X_train, y_train)
print(model, "\n")

# Visualize the model tree
tree = export_text(model)
print(tree)

Let us calculate the ROC-curve and store its AUC.

In [None]:
# Calculate AUC and store
y_prob = model.predict_proba(X_test)
auc = roc_auc_score(y_test2, y_prob[:,1])

# Plot ROC
fpr, tpr, thresholds = roc_curve(y_test2, y_prob[:,1])
fig = plt.figure(figsize=(6, 6))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve -- DecisionTreeClassifier')
plt.show()

# Concatinate AUC
m = pd.Series([auc])
m.index = ["AUC"]
metrics = pd.concat([metrics, m.rename("DecisionTreeClassifier")], axis=1)
metrics

Finally, we will repeat the process with a model using an *ensemble* algorithm named *Random Forest* that combines the outputs of multiple random decision trees (for more details, see the [Scikit-Learn documentation](https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees)).

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Provide number of estimator (hyperparameter)
n = 100

# train a linear SVM model on the training set
model = RandomForestClassifier(n_estimators=n).fit(X_train, y_train)
print(model)

# Calculate AUC and store
y_prob = model.predict_proba(X_test)
auc = roc_auc_score(y_test2, y_prob[:,1])

# Plot ROC
fpr, tpr, thresholds = roc_curve(y_test2, y_prob[:,1])
fig = plt.figure(figsize=(6, 6))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve -- RandomForestClassifier')
plt.show()

# Concatinate AUC
m = pd.Series([auc])
m.index = ["AUC"]
metrics = pd.concat([metrics, m.rename("RandomForestClassifier")], axis=1)
metrics

## Save your model and inference new cases

We have tried several machine learning models for our regression dataset.
Let us save the *LogisticRegression* model, as a local file.

This can be done like this,

In [None]:
import joblib

# Save the model as a pickle file
filename = 'model.pkl'
joblib.dump(model_log, filename)

print("Model saved!")

Now, we can load it whenever we need it, and use it to predict labels for new data. This is often called *scoring* or *inferencing*.

The scenario might be that measurements of a cell nucleus of a new patient has been measured, and we want to predict whether the tumor of the patient is benign or malignant.

In [None]:
import numpy as np

# Load the model from the file
loaded_model = joblib.load(filename)

# Create a numpy array containing the data on measurements of the cell nucleus
X_new = np.array([[19.69, 21.25, 130.0, 1203.0, 0.1096, 0.1599, 0.1974, 0.1279, 0.2069, 0.0599, 0.7456, 0.7869, 4.585, 94.03, 0.00615, 0.0401, 0.03832, 0.02058, 0.0225, 0.004571, 23.57, 25.53, 152.5, 1709.0, 0.1444, 0.4245, 0.4504, 0.243, 0.3613, 0.0873]]).astype('float64')
print ('New sample: {} \n'.format(list(X_new[0])))

# Use the model to predict type tumor
result = loaded_model.predict(X_new)
print('Prediction tumor type: {} \n'.format(result[0]))

## Further Readings

Sources:
- To learn more about Scikit-Learn, see the [Scikit-Learn documentation](https://scikit-learn.org/stable/user_guide.html).
- To learn more about machine learning basics on other datasets, see the [Microsoft's ml-basics tutorials](https://github.com/MicrosoftDocs/ml-basics).