# Breast Cancer

## A. Problem Understanding

Despite a great deal of public awareness and scientific research, breast cancer continues to be the most common cancer and the second largest cause of cancer deaths among women. Approximately 12% of U.S. women will be diagnosed with breast cancer, and 3.5% will die of it. The annual mortality rate of approximately 28 deaths per 100,000 women has remained nearly constant over the past 20 years. A breast cancer victim’s chances for long-term survival are improved by early detection of the disease, and early detection is in turn enhanced by an accurate diagnosis. After the diagnosis, for each patient with breast cancer, we classify the severity of cancers as malignant or benign in order to give them special treatments.

## B. Data Understanding

First, a sample of fluid is taken from the patient’s breast. This outpatient procedure involves using a small-
gauge needle to take the fluid, known as a fine needle aspirate (FNA), directly from a breast lump or mass, the
lump having been previously detected by self-examination and/or mammoaphy. The fluid from the FNA is placed
on a glass slide and stained to highlight the nuclei of the constituent cells. An image from the FNA is transferred
to a workstation by a video camera mounted on a microscope.

Xcyt uses a curve-fitting program to determine the exact boundaries of the nuclei. The boundaries are initialized by an operator using a mouse pointer. For a typical image containing between 10 and 40 nuclei, the image analysis process takes approximately two to five minutes. Ten features are computed for each nucleus: area, radius, perimeter, symmetry, number and size of concavities, fractal dimension (of the boundary), compactness, smootimess (local variation of radial seg ments), and texture (variance of gray levels inside the boundary). The mean value, extreme value (i.e., largest or worst value: biggest size, most irregular shape) and standard error of each of these cellular features are com puted for each image, resulting in a total of 30 real-valued features.

#### 1. Data Description

**data.csv**

1. ID number
2. Diagnosis (M = malignant, B = benign)
3. Ten real-valued features are computed for each cell nucleus:


- radius (mean of distances from center to points on the perimeter) 
- texture (standard deviation of gray-scale values) 
- perimeter 
- area 
- smoothness (local variation in radius lengths) 
- compactness (perimeter^2 / area - 1.0) 
- concavity (severity of concave portions of the contour) 
- concave points (number of concave portions of the contour) 
- symmetry 
- fractal dimension ("coastline approximation" - 1)

Note: Mean, Etandard Error (SE) and Worst (mean of the three largest values) of these features are obtained from each image, resulting in 30 features. For example, the third column is Mean Radius, column 13 is Radius SE, column 23 is Worst Radius. All feature values are stored with four significant numbers.

#### 2. Load The Data

In [3]:
#import library
import pandas as pd
import pandas_profiling
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
#import data
data = pd.read_csv('../input/breast-cancer.csv')

In [4]:
data

In [6]:
#Check the null data
data.isnull().sum()

#### 3. Data Types

In [None]:
#Check data types and memory usage
data.info()

# C. Data Exploration

On the data exploration, we will see the distribution of each variable using a histogram. In the histogram, the horizontal axis is the data of the feature while the vertical axis is the frequency of occurrence. The correlation test is used to evaluate the relationship between two numerical variables. If two variables have a correlation coefficient, then the two variables are numerical variables, while the remainder are categorical variables.

Before go to correlation test, we need to change the target of classification in the column of diagnosis to be numerical. So, we can also include the target to the correlation test, because the correlation test can process only numerical data. It is also important to binarized our target because it is need to convert to 0 and 1 to calculate F1 score of our model evaluation.

In [None]:
# Change diagnosis to numerical data M --> 1; B--> 0
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
data['diagnosis'] = lb.fit_transform(data['diagnosis'])

We also consider to remove unnecessary data that is clearly not required. For example, the patient ID and other blank features. This is helpful to speed up our correlation test.

In [None]:
# Delete unnecessary features
del data['Unnamed: 32']
del data['id']

In [None]:
# Check data shape after removal
data.shape

There are 31 features now (including the target)

Next, let's explore all variables using Pandas Proiling Report. From the histograms, below we can see the distributions are normal.

In [None]:
pandas_profiling.ProfileReport(data)

# D. Data Preprocessing

#### 1. Data Selection

Now, we need to select the data based on the question that we want to address, which is classification of the cancer. Even though, the available data are seems to be relevant, we need to conduct the correlation test to make sure we used the features that are must be included. The available data can includes independent variables and dependent variables. What we need is to ensure the inputs include all independent varibales, and each feature doesn't make a high correlation with the target or with other input(s), we can identify them by evaluating through correlation test. If the data are highly correlated, we then exlude it.

From the Pandas Profiling, thare are 14 warnings, 4 are due to zero values. In this case, because the data are obtained using real image, we assume zero values are possible and not human error, so we can't exclude that.

From the above warning, we can see that there are 10 warning regarding the correlation of the features, we can group them as 2 groups of correlation test

**Group 1 (Features in Radius, Perimeter, and Area)**

Take a look at the pandas profile report
* area_mean is highly correlated with perimeter_mean (ρ = 0.98651) Rejected
* area_se is highly correlated with perimeter_se (ρ = 0.93766) Rejected
* area_worst is highly correlated with perimeter_worst (ρ = 0.97758) Rejected
* perimeter_mean is highly correlated with radius_mean (ρ = 0.99786) Rejected
* perimeter_se is highly correlated with radius_se (ρ = 0.97279) Rejected
* perimeter_worst is highly correlated with radius_worst (ρ = 0.99371) Rejected
* radius_worst is highly correlated with area_mean (ρ = 0.96275) Rejected


In [None]:
# Correlation test of group 1
group_1 = data.loc[:, ["radius_mean", "perimeter_mean","area_mean","radius_se","perimeter_se", "area_se",
"radius_worst","perimeter_worst", "area_worst"]].copy()
sns.heatmap(group_1.corr(),annot=True)

First, we will choose one out of three variables that are higly correlated
* From "radius_mean", "perimeter_mean", and "area_mean", we choose "radius_mean"
* From "radius_se","perimeter_se", and  "area_se", we choose "radius_se"
* From "radius_worst","perimeter_worst", and  "area_worst", we choose "radius_worst"

Second, due to the high correlation of the radius features ("radius_mean" and "radius_worst"), we need to choose one, let's take the "radius_mean"

At this step, we keep 2 variables **"radius_mean"** and **"radius_se"** and will exclude 7 other variables in group_1 ("perimeter_mean","area_mean","perimeter_se", "area_se",
"radius_worst","perimeter_worst", "area_worst")

**Group 2 (Features in concave points, texture, and concavity)**

 Look at the pandas profile report, we can witness
* concave points_mean is highly correlated with concavity_mean (ρ = 0.92139) Rejected
* concave points_worst is highly correlated with concave points_mean (ρ = 0.91016) Rejected
* texture_worst is highly correlated with texture_mean (ρ = 0.91204) Rejected

In [None]:
# Correlation test of group 1
group_2 = data.loc[:, ["concave points_mean", "concavity_mean","texture_mean","concave points_se","concavity_se", "texture_se",
"concave points_worst","concavity_worst", "texture_worst"]].copy()
sns.heatmap(group_2.corr(),annot=True)

As on the previous test, we need to keep some features and remove the other features that are unnecessary.
* From "concave points_mean", "concavity_mean", and "concave points_worst", we choose "concave points_mean"
* From "texture_mean", and  "texture_worse", we choose "texture_mean"

At this step, we keep variables **"concave points_mean"** and **"texture_mean"** and will exclude 3 other variables in group_2 ("concavity_mean", "concave points_worst", and "texture_worse")

#### 2. Preprocess Data

After we know what features to be excluded, let's make the sample data for analysis or the data that we want to work with.

In [None]:
data = data.drop(['perimeter_mean','area_mean','perimeter_se','area_se','radius_worst','perimeter_worst', 'area_worst',
                 'concavity_mean','concave points_worst','texture_worst'],1)

In [None]:
# See the correlation again after removing unwanted features
data.corr()

In [None]:
#Check the current features
data.columns

#### 3. Data Transformation

We need to check the boundaries (minimum and maximum values) of each features.

In [None]:
# Check summary statistics
data.describe()

It is better to scale the numeric data because every feature has different scale.

In [None]:
# Data transformation using Standard Scaler
from sklearn.preprocessing import StandardScaler
numeric_data = data.iloc[:,1:22]
sc = StandardScaler()
input = pd.DataFrame(sc.fit_transform(numeric_data))
input.columns = ['radius_mean', 'texture_mean', 'smoothness_mean',
       'compactness_mean', 'concave points_mean', 'symmetry_mean',
       'fractal_dimension_mean', 'radius_se', 'texture_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'smoothness_worst', 'compactness_worst',
       'concavity_worst', 'symmetry_worst', 'fractal_dimension_worst']

In [None]:
# Preview the result of transformation
input

# E. Data Modelling

Let's prapare our input and output using tran test split before we create models.

In [None]:
# Train test split
from sklearn.model_selection import train_test_split
X = input
y = data['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

We create 3 basic models and then optimze each models using Hyperparameter Search technique. The model we used are:
1. Random Forest
2. K Nearest Neighbours
3. Support Vector Machine (SVM)

#### 1. Random Forest

In [None]:
#Import random forest calassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Create random forest model 
rf_model = RandomForestClassifier(random_state=0)

In [None]:
# Apply the model
rf_model.fit(X_train, y_train)

In [None]:
# Predicted value
y_pred1 = rf_model.predict(X_test)

In [None]:
#Create model evaluation function
def evaluate(model, test_features, test_labels):
    from sklearn.metrics import f1_score
    predictions = model.predict(test_features)
    F1 = np.mean(f1_score(test_labels, predictions))
    print('Model Performance')
    print('F1 score = %.3f' % F1)
    
    return f1_score

In [None]:
#f1 score before optimization
f1_before_rf= evaluate(rf_model, X_test, y_test)

In [None]:
#confusion matrix before optimization
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred1)

In [None]:
# Random forest optimization parameters
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 778, stop = 784, num = 7)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt',5,6,7,8]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(start = 8, stop = 14, num = 7)]
# Minimum number of samples required to split a node
min_samples_split = [int(x) for x in np.linspace(start = 10, stop = 14, num = 5)]
# Minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in np.linspace(start = 1, stop = 6, num = 5)]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Method of selecting xriterion
criterion = ['gini', 'entropy']
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion':criterion}
print(random_grid)

In [None]:
#Create new model using the parameters
rf_random = RandomizedSearchCV(estimator = rf_model, param_distributions = random_grid, n_iter = 15,
                               cv = 5, verbose=2, random_state=0, n_jobs = -1)

In [None]:
#Apply the model
rf_random.fit(X_train, y_train)

In [None]:
#View the best parameters
rf_random.best_params_

In [None]:
# Predicted value
y_pred1_ = rf_random.best_estimator_.predict(X_test)

In [None]:
#f1 score after optimization
best_random = rf_random.best_estimator_
f1_after_rf= evaluate(best_random, X_test, y_test)

In [None]:
#confusion matrix after optimization
confusion_matrix(y_test, y_pred1_)

#### 2. KNN

In [None]:
#Import KNN calassifier
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Create KNN model
kn_model = KNeighborsClassifier(n_neighbors=5)

In [None]:
# Apply the model
kn_model.fit(X_train, y_train)

In [None]:
# Predicted value
y_pred2 = kn_model.predict(X_test)

In [None]:
#f1 score before optimization
f1_before_kn= evaluate(kn_model, X_test, y_test)

In [None]:
#confusion matrix before optimization
confusion_matrix(y_test, y_pred2)

In [None]:
# KNN optimization parameters
n_neighbors = [5,6,7,8,9,10]
leaf_size = [1,2,3,5]
weights = ['uniform', 'distance']
algorithm = ['auto', 'ball_tree','kd_tree','brute']

random_grid_kn = {'n_neighbors':n_neighbors,
                  'leaf_size':leaf_size,
                  'weights':weights,
                  'algorithm':algorithm}
print(random_grid_kn)

In [None]:
#Create new model using the parameters
kn_random = RandomizedSearchCV(estimator = kn_model, param_distributions = random_grid_kn, n_iter = 15,
                           cv = 5, verbose=2, random_state=123, n_jobs = -1)

In [None]:
#Apply the model
kn_random.fit(X_train, y_train)

In [None]:
#View the best parameters
kn_random.best_params_

In [None]:
# Predicted value
y_pred2_ = kn_random.best_estimator_.predict(X_test)

In [None]:
#f1 score after optimization
best_random_kn = kn_random.best_estimator_
f1_after_kn= evaluate(best_random_kn, X_test, y_test)

In [None]:
#confusion matrix after optimization
confusion_matrix(y_test, y_pred2_)

#### 3. SVM

In [None]:
#Import SVM calassifier
from sklearn.svm import SVC

In [None]:
# Create SVM model
svc_model = SVC(random_state=123)

In [None]:
# Apply the model
svc_model.fit(X_train, y_train)

In [None]:
# Predicted value
y_pred3 = svc_model.predict(X_test)

In [None]:
#f1 score before optimization
f1_before_svc= evaluate(svc_model, X_test, y_test)

In [None]:
#confusion matrix score optimization
confusion_matrix(y_test, y_pred3)

In [None]:
# SVM optimization parameters
C= [0.123,0.124, 0.125, 0.126, 0.127]
kernel = ['linear','rbf','poly']
gamma = [0, 0.0000000000001, 0.000000000001, 0.00000000001]

random_grid_svm = {'C': C,
                   'kernel': kernel,
                   'gamma': gamma}
print(random_grid_svm)

In [None]:
#Create new model using the parameters
svc_random = RandomizedSearchCV(estimator = svc_model, param_distributions = random_grid_svm, n_iter = 15,
                           cv = 5, verbose=2, random_state=123, n_jobs = -1)

In [None]:
#Apply the model
svc_random.fit(X_train, y_train)

In [None]:
#View the best parameters
svc_random.best_params_

In [None]:
# Predicted value
y_pred3_ = svc_random.best_estimator_.predict(X_test)

In [None]:
#f1 score after optimization
best_random_svc = svc_random.best_estimator_
f1_after_svc= evaluate(best_random_svc, X_test, y_test)

In [None]:
#confusion matrix after optimization
confusion_matrix(y_test, y_pred3_)

For the above confusion matrix, we can see that the false positive = 0 and the false negative = 1.
Let me remind you what does it mean.
* False positives (FP): We predicted yes, but they don't actually have the malignant cancer.
* False negatives (FN): We predicted no, but they actually do have the malignant cancer.

FP is the most important indicator. To illustrate, if there is a value in FP, it means that the patient with benign cancer predicted as malignant cancer. It is very dangerous, because the patient will have the serious treatment, consume a high-dose drug category, or have a serious surgery that is actually not appropriate for such patient. If the cancer is identified as malignant, there is a sort amount of time or even no time to re-evaluate the patient, and the wrong treatment will be taken by the doctor and make the patient in danger.

In contrast, a value in FN is the number of malignant patient, who are predicted as benign. There is a time to re-assess the patient in order to provide better treatment. Clearly, this not severe as the opposite situation.

Because the FP is zero and NP is very small (1). The predictive model using SVM does very well.

# F. Evaluation

Overall, the model perform well to predict the class of cancer with F1 score > 94% even not using hyperparameter optimization

* F1 score of **Random Forest** model = 94.7%
* F1 score of **KNN** model = 97.0%
* F1 score of **SVM** model = 99.3%

To increase the F1 score, we have applied hyperparameter tuning using RandomizedSearch and obtain
* F1 score of **Optmized Random Forest** model = 94.9%
* F1 score of **Optmized KNN** model = 97.7%
* F1 score of **Optmized SVM** model = 99.3%

We can conclude that SVM is the best model to classify the breast cancer with the optimum F1 score of 99.3%