# **Breast Cancer Wisconsin Diagnostic**

# **Introduction**

Detection of breast cancer is the preliminary phase in cancer diagnosis. So, classifiers with higher accuracy are always desired. A classifier with high accuracy offers very less chance to wrongly classify a patient of cancer. 

Breast cancer refers to cancer from a malignant tumor in the cells of the breast tissue. A malignant tumor is a group of cancer cells that can grow into surrounding tissues or spread to distant areas of the body. Breast cancer is uncontrolled multiplication of cells in breast tissue. A group of rapidly dividing cells may form a lump or architectural distortions. 

There are two main classifications of tumors. One is known as **benign** and the other as **malignant**. A benign tumor is a tumor that does not invade its surrounding tissue or spread around the body. A malignant tumor is a tumor that may invade its surrounding tissue or spread around the body.

**Benign** tumors are **non-malignant/non-cancerous tumor**. A benign tumor is usually localized, and does not spread to other parts of the body.  **Malignant** tumors are **cancerous** growths. They are often resistant to treatment, may spread to other parts of the body and they sometimes recur after they were removed.

There are two aspects of diagnosis of cancerous cells while doing testing.  A **false-positive** test occurs when test results appear to be abnormal, even though there is actually no cancer. A **false-negative** is when test results show no cancer when there really is cancer.

No test is perfect: a perfect test would give only **true positive** and **true negative** results, but a good screening test should have a low rate of **false-positive** and **false-negative** results. **False-positive** results can create undue stress, anxiety, and can lead to other unnecessary testing. **False-negative** results can delay treatment. I feel that **false-negative is more dangerous** in case of cancer detection, because patient think that he do not have cancer and therefore he will not take any precation and medical treatment. Although cancerous cells keep growing inside and can lead to next phase of cancer. When again patient starting feel uncomfortable, then he may go for another round of testing, but by that time it may be too late to cure the cancer.

So, in this case we have emphasis more on find accurate false-negative rather than the accuracy rate.

Let's explore a classification task with **Keras API for TF 2.0 with Early Stopping and Dropout Layer**.


# **Load the data**

## Import libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import math
import matplotlib
import tensorflow as tf

# Print versions of libraries
print(f"Numpy version : Numpy {np.__version__}")
print(f"Pandas version : Pandas {pd.__version__}")
print(f"Matplotlib version : Matplotlib {matplotlib.__version__}")
print(f"Seaborn version : Seaborn {sns.__version__}")
print(f"Tensorflow version : Tensorflow {tf.__version__}")

#Magic function to display In-Notebook display
%matplotlib inline

# Setting seabon style
sns.set(style='darkgrid', palette='Set2')

## Import dataset

In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv', encoding = 'latin-1')

In [None]:
df.head(10).T

# **Exploratory Data Analysis**

Once the data is read into python, we need to explore/clean/filter it before processing it for machine learning It involves adding/deleting few colums or rows, joining some other data, and handling qualitative variables like dates.

In [None]:
df.columns

**Attribute Information:**

* 1) ID number
* 2) Diagnosis (M = malignant, B = benign) 

Attribute 3-32:

Ten real-valued features are computed for each cell nucleus:

* a) radius (mean of distances from center to points on the perimeter)
* b) texture (standard deviation of gray-scale values)
* c) perimeter
* d) area
* e) smoothness (local variation in radius lengths)
* f) compactness (perimeter^2 / area - 1.0)
* g) concavity (severity of concave portions of the contour)
* h) concave points (number of concave portions of the contour)
* i) symmetry
* j) fractal dimension ("coastline approximation" - 1)

The **mean**, **standard error (se)** and **worst** or **largest** (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

**diagnosis**: (Wisconsin Diagnostic Breast Cancer - WDBC)
* WDBC-Malignant
* WDBC-Benign

## Checking concise summary of dataset

It is also a good practice to know the features and their corresponding data types,along with finding whether they contain null values or not.

In [None]:
df.info()

**Observations**
* Dataset contains details of 569 transactions with 33 features.
* Data has float, integer, and object/String type values.
* Diagnosis is string haing values M and B (M = malignant, B = benign).
* Every feature has 569 values, so there is no missing values in the form of NaN or NA.
* There is feature with name "Unnamed" with NAN values.
* All data types are float64 ,except 1 : diagnosis 
* Memory Usage : 147KB only, not so Harsh !!

## Delete unwanted columns

Id and the column 'Unnamed: 32' is not useful for data analysis, so lets remove them first.

In [None]:
df.drop(['id','Unnamed: 32'],axis=1, inplace=True)

In [None]:
df.head().T

## Generate descriptive statistics

Lets summarize the central tendency, dispersion and shape of a dataset's distribution.

In [None]:
df.describe().T

**Observations**
* area_mean, perimeter_se, area_se, area_worst are highly positive skewed.

## Malignant and Benign Distribution

In [None]:
df["diagnosis"].value_counts().plot(kind = 'pie',explode=[0, 0.1],figsize=(6, 6),autopct='%1.1f%%',shadow=True)
plt.title("Malignant and Benign Distribution",fontsize=20)
plt.legend(["Benign", "Malignant"])
plt.show()

In [None]:
print(df['diagnosis'].value_counts())
print('\n')
print(df['diagnosis'].value_counts(normalize=True))

**Observations**

This dataset contain about 37% cancerous Malignant cells and about 62.7% Bening non cancerous cells.

## Histogram of Radius Mean for Bening and Malignant Tumors

In [None]:
plt.figure(figsize=(12,10))

sns.distplot(df[df['diagnosis'] == 'M']["radius_mean"], color='g', label = "Bening") 
sns.distplot(df[df['diagnosis'] == 'B']["radius_mean"], color='r', label = "Malignant") 

plt.xlabel("Radius Mean Values")
plt.ylabel("Frequency")
plt.title("Histogram of Radius Mean for Bening and Malignant Tumors", fontsize=14)
plt.legend()

plt.show()

In [None]:
# most_frequent_bening_radius_mean
df[df["diagnosis"] == 'B']['radius_mean'].value_counts().idxmax()

In [None]:
# most_frequent_malignant_radius_mean
df[df["diagnosis"] == 'M']['radius_mean'].value_counts().idxmax()

**Observations**

* From this graph you can see that radius mean of malignant tumors are bigger than radius mean of bening tumors mostly.
* The bening distribution (green in graph) is approcimately bell-shaped that is shape of normal distribution (gaussian distribution)
* Also you can find result like that most frequent malignant radius mean is 15.46 and most frequent bening radius mean is 11.06.

## Distribution of other features

In [None]:
features_mean=list(df.columns[1:11])
# split dataframe into two based on diagnosis
dfM=df[df['diagnosis'] == 'M']
dfB=df[df['diagnosis'] == 'B']

#Stack the data
plt.rcParams.update({'font.size': 8})
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(14,16))
axes = axes.ravel()

for idx,ax in enumerate(axes):
    ax.figure
    binwidth= (max(df[features_mean[idx]]) - min(df[features_mean[idx]]))/50
    
    ax.hist([dfM[features_mean[idx]],dfB[features_mean[idx]]], 
            bins=np.arange(min(df[features_mean[idx]]), max(df[features_mean[idx]]) + binwidth, binwidth) , 
            alpha=0.5,
            stacked=True, 
            density = True, 
            label=['M','B'],
            color=['r','g'])
    
    ax.legend(loc='upper right')
    ax.set_title(features_mean[idx])
plt.tight_layout()
plt.show()

**Observations**
* Mean values of cell radius, perimeter, area, compactness, concavity and concave points can be used in classification of the cancer. Larger values of these parameters tends to show a correlation with malignant tumors.

* Mean values of texture, smoothness, symmetry or fractual dimension does not show a particular preference of one diagnosis over the other. In any of the histograms there are no noticeable large outliers that warrants further cleanup.

In [None]:
melted_data = pd.melt(df,id_vars = "diagnosis",value_vars = ['radius_mean', 'texture_mean'])

plt.figure(figsize = (14,8))
sns.boxplot(x = "variable", y = "value", hue="diagnosis",data= melted_data, fliersize=0)

# plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.);
# plt.legend(["Benign", "Malignant"])
plt.show()

**Observations**

* radius_mean and texture_mean are higher for Malignant cells. It means cancerous cells have higher (about radius_mean = 17.5 & texture_mean = 22) values of radius_mean and texture_mean as compared to non-cancerous cells.

## Relationship between more than 2 distribution

In [None]:
# Also we can look relationship between more than 2 distribution
# sns.set(style = "white")

sns.pairplot(df, vars=["radius_mean","area_mean","texture_mean",'smoothness_mean',"fractal_dimension_se"], hue='diagnosis')
plt.suptitle('Relations ship between features');
plt.show()

**Observations**

* All expect area_mean are more or less have bell shape curves. It means they have normal distribution.

In [None]:
plt.figure(figsize = (15,10))
sns.jointplot(df['radius_mean'],df['area_mean'],kind="reg")
plt.show()

## Correlation Among Explanatory Variables

Having **too many features** in a model is not always a good thing because it might cause overfitting and worser results when we want to predict values for a new dataset. Thus, **if a feature does not improve your model a lot, not adding it may be a better choice.**

Another important thing is **correlation. If there is very high correlation between two features, keeping both of them is not a good idea most of the time not to cause overfitting.** However, this does not mean that you must remove one of the highly correlated features. 

In [None]:
plt.figure(figsize=(18,18))
plt.title('Pearson Correlation Matrix')
# Generating correlation
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=np.bool))
sns.heatmap(corr,mask = mask,linewidths=0.25,vmax=0.7,square=True,cmap="viridis",linecolor='w',annot=True,cbar_kws={"shrink": .7});
plt.show()

## Reset the index

In [None]:
df.reset_index(inplace = True , drop = True)

# **Splitting data into Training and Testing samples**

We dont use the full data for creating the model. Some data is randomly selected and kept aside for checking how good the model is. This is known as Testing Data and the remaining data is called Training data on which the model is built. Typically 70% of data is used as Training data and the rest 30% is used as Tesing data.

In [None]:
df['diagnosis'].value_counts()

In [None]:
df['diagnosis'] = df['diagnosis'].map({'M': 1,'B': 0})

In [None]:
df['diagnosis'].value_counts()

## so 1 represents M and 0 represts B.

In [None]:
X = df.drop('diagnosis',axis=1).values
y = df['diagnosis'].values

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=101)

In [None]:
# Quick sanity check with the shapes of Training and testing datasets
print("X_train - ",X_train.shape)
print("y_train - ",y_train.shape)
print("X_test - ",X_test.shape)
print("y_test - ",y_test.shape)

# **Scale Amount Feature**

* It is good idea to scale the data, so that the column(feature) with lesser significance might not end up dominating the objective function due to its larger range. like a column like age has a range between 0 to 80, but a column like salary has range from thousands to lakhs, hence, salary column will dominate to predict the outcome even if it may not be important.
* In addition, features having different unit should also be scaled thus providing each feature equal initial weightage. Like Age in years and Sales in Dollars must be brought down to a common scale before feeding it to the ML algorithm
* This will result in a better prediction model.


In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
scaler.fit(X_train)

In [None]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Creating the Model

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout

In [None]:
model = Sequential()

# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

model.add(Dense(units=30,activation='relu'))
model.add(Dense(units=15,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))

# For a binary classification problem
model.compile(loss='binary_crossentropy', optimizer='adam')

# **Training the Model**

 

# Example One: Choosing too many epochs and overfitting!

In [None]:
# https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network
# https://datascience.stackexchange.com/questions/18414/are-there-any-rules-for-choosing-the-size-of-a-mini-batch

model.fit(x=X_train, 
          y=y_train, 
          epochs=600,
          validation_data=(X_test, y_test), verbose=1
          )

In [None]:
model_loss = pd.DataFrame(model.history.history)

In [None]:
plt.figure(figsize=(12,8))
model_loss.plot()
plt.show()

# Example Two: Early Stopping

We obviously trained too much! Let's use early stopping to track the val_loss and stop training once it begins increasing too much!

### Early stopping:
**Too many epochs can lead to overfitting of the training dataset, whereas too few may result in an underfit model. Early stopping is a method that allows you to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out validation dataset.**

> More at : https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

In [None]:
model = Sequential()

model.add(Dense(units=30,activation='relu'))
model.add(Dense(units=15,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam')

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

**Stop training when a monitored quantity has stopped improving.**

In [None]:
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)

In [None]:
model.fit(x=X_train, 
          y=y_train, 
          epochs=600,
          validation_data=(X_test, y_test), verbose=1,
          callbacks=[early_stop]
          )

In [None]:
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

# Example Three: Adding in DropOut Layers

### Dropout Layers
**Dropout is a regularization technique for neural network models. So in this technique randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

More at : https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras

In [None]:
from tensorflow.keras.layers import Dropout

In [None]:
model = Sequential()

model.add(Dense(units=30,activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(units=15,activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(units=1,activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam')

In [None]:
model.fit(x=X_train, 
          y=y_train, 
          epochs=600,
          validation_data=(X_test, y_test), verbose=1,
          callbacks=[early_stop]
          )

In [None]:
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

# **Model Evaluation**

In [None]:
y_train_pred = model.predict_classes(X_train)
y_test_pred = model.predict_classes(X_test)

In [None]:
from sklearn import metrics

In [None]:
# https://en.wikipedia.org/wiki/Precision_and_recall
print(metrics.classification_report(y_test, y_test_pred))

In [None]:
y_test_pred = y_test_pred.flatten()

## Confusion Matrix

In [None]:
# Heatmap for Confusion Matrix
cnf_matrix = metrics.confusion_matrix(y_test,y_test_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, annot_kws={"size": 25}, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1, fontsize = 22)
plt.ylabel('Actual',fontsize = 18)
plt.xlabel('Predicted',fontsize = 18)
plt.show()

## __Accuracy , Precision and Recall__


### __Accuracy__ : The most used and classic classification metric : Suited for binary classification problems.

$$  \text{Accuracy} = \frac{( TP + TN ) }{ (TP + TN + FP + FN )}$$

Basically Rightly predicted results amongst all the results , used when the classes are balanced

### __Precision__ : What proportion of predicted positives are truly positive ? Used when we need to predict the positive thoroughly, sure about it !

$$ \text{Precision} = \frac{( TP )}{( TP + FP )} $$

### __Sensitivity or Recall__ : What proportion of actual positives is correctly classified ? choice when we want to capture as many positives as possible

$$ \text{Recall} = \frac{(TP)}{( TP + FN )} $$

### F1 Score : Harmonic mean of Precision and Recall. It basically maintains a balance between the precision and recall for your classifier

$$ F1 = \frac{2 * (\text{ precision } * \text{ recall })}{(\text{ precision } + \text{ recall } )} $$



There are two aspects of diagnosis of cancerous cells while doing testing.  A **false-positive** test occurs when test results appear to be abnormal, even though there is actually no cancer. A **false-negative** is when test results show no cancer when there really is cancer.

No test is perfect: a perfect test would give only **true positive** and **true negative** results, but a good screening test should have a low rate of **false-positive** and **false-negative** results. **False-positive** results can create undue stress, anxiety, and can lead to other unnecessary testing. **False-negative** results can delay treatment. I feel that **false-negative is more dangerous** in case of cancer detection, because patient think that he do not have cancer and therefore he will not take any precation and medical treatment. Although cancerous cells keep growing inside and can lead to next phase of cancer. When again patient starting feel uncomfortable, then he may go for another round of testing, but by that time it may be too late to cure the cancer.

So, in this case we have emphasis more on finding the 

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

https://kennis-research.shinyapps.io/Bayes-App/


In [None]:
# Printing the Overall Accuracy of the model
print("Accuracy of the model : {0:0.3f}".format(metrics.accuracy_score(y_test, y_test_pred)))

In [None]:
print("Count of Actual values of Test data :")
print(pd.Series(y_test).value_counts())

print("\n")

print("Count of Predected values of Test data :")
print(pd.Series(y_test_pred).value_counts())

### Real Accuracy 

In [None]:
54/55

In [None]:
cnf_matrix[1][1]/pd.Series(y_test).value_counts()[1]

### So 98.18% is our real accuracy.

<p style="font-weight:bold;color:#1E90FF;font-size:18px">I welcome comments, suggestions, corrections and of course votes also.</p>