# Breast Cancer Detection

In [None]:
#Import important libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Load the wisconsin dataset
data = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

data.head(10)

## Cleaning the data

In [None]:
# Printing all the columns
data.columns[1:]

In [None]:
data.drop('Unnamed: 32', axis=1, inplace=True)

data.columns[1:]

## Visual Analysis

![](https://miro.medium.com/max/4000/0*0XRrnsr7h5hebu8r.png)


### As obvious from the image above, malignant cells are larger in size, thus have a bigger radius, perimeter and area.  Due to an arbitrary shape, the malignant cells are also expected to be more concave, with a large number of concave, finger-like projections.

## Encoding Categorical Values

In [None]:
#Using Label Encoders to replace values in a single column
from sklearn import preprocessing

labelEncoder = preprocessing.LabelEncoder()

In [None]:
data['diagnosis'] = labelEncoder.fit_transform(data['diagnosis'])

print(data['diagnosis'].head())

## Plotting the initial data

In [None]:
# import plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Pie chart representation
pie_labels = ['Benign', 'Malignant']

#Number of benign and malignant cases
pie_y = data['diagnosis'].value_counts()

pie_explode = [0, 0.1]

plt.figure(figsize=(10, 8))
plt.pie(pie_y, labels=pie_labels, shadow=True,  autopct='%1.1f%%', explode=pie_explode, textprops={'fontsize': 14})
plt.legend()
plt.title("Percent of Cases in the Data")
plt.show()

In [None]:
# Settings all columns on y axis and plotting them
%matplotlib inline
sns.set_theme(style="whitegrid")

x = data['diagnosis']

y = ['radius_mean', 'texture_mean', 'perimeter_mean',
     'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
     'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']

fig, axes = plt.subplots(len(y)//2, 2, figsize=(20, 35))
axes_flat = axes.flatten()

index = 0
for para in y:
    axis = axes_flat[index]
    axis.xaxis.label.set_size(20)
    axis.yaxis.label.set_size(20)
    axis.tick_params(labelsize=18)
    
    sns.boxplot(x=x, y=data[para], data=data, ax=axis)
    index += 1
    
plt.show()

### Observations from Box plots

The classification of cancer depending on variation in each individual parameter has been shown with the help of boxplots. The following inferrences can be drawn by observing these graphs:-

- For each parameter (eg. radius_mean), more separated the two boxes are, the more significant role the parameter would play in deciding whether the cancer is benign or malignant. This is because, more the separation, more clear would be the signs of an abnormal behaviour by the cells.
- This implies that the "Fractal dimension" of the cell will have little to no impact in determining the outcome.
- Similarly, "Symmetry" of the cell is also not that influential for the result.
- The gap between the boxes (and in turn distribution of data) in "Smoothness" of the cell is not that significant. The upper limit of smoothness in benign cases almost overlaps with the median of the malignant ones. Thus this property of the cell should not be given a lot of weight (but cannot be neglected) in the detection, since there is a probability that it can classify average and below average smoothness cases as benign instead of malignant.
- Almost all the other properties of the cells show a clearer distinction in benign and malignant cases, indicating that they will probably have a stronger say in determination of the result.

Important: It must be noted that the above observations have be made by considering only the middle 50 percentile (i.e the box part) as it is just an human observation rather than a calculated judgement (Which would be too complicated for an initial observation and cannot be done with graphical observation).

In [None]:
#Separating result from input dataset
y = data['diagnosis']
data.drop(labels=['diagnosis'], axis=1, inplace=True)

data.head()
print(y.head())

## More Preprocessing and Scaling

In [None]:
#Splitting the dataset 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3, random_state=0)

print(X_train.shape)

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
#Getting a Scaled Dataframe
all_cols = data.columns
index = data.index

scaled_data = pd.DataFrame(X_train, columns=all_cols)

scaled_data.head()

## Helper Funtion For Plotting

In [None]:
from sklearn.metrics import confusion_matrix

#Using GridSearchCV for hyperparameter tuning
from sklearn.model_selection import GridSearchCV

#Helps to plot confusion matrix of different Models
def plot_confusion(prediction):
    confusion_labels = np.array([['True Neg.', 'False Pos.'], ['False Neg.', 'True Pos.']])
    
    cm = confusion_matrix(y_test, prediction)

    df_cm = pd.DataFrame(cm, range(2), range(2))

    plt.figure(figsize=(8,8))

    sns.set(font_scale=1.4) # for label size

    labels = (np.asarray(["{0}\n\n{1}".format(string, value)
                          for string, value in zip(confusion_labels.flatten(),
                                                   cm.flatten())])
             ).reshape(2,2)

    ax = sns.heatmap(df_cm, annot=labels, annot_kws={"size": 16}, fmt='', cbar=False) # font size
    
    
    # Confusion Matrix Labels
    ax.set_xticklabels(['Negative', 'Positive'])
    ax.set_yticklabels(['Negative', 'Positive'])

    plt.show()
    
    #Print Important Medical terms
    Specificity = (cm[0][0]) / (cm[0][0] + cm[0][1])
    print("\nSpecificity is: {0:.2f}%".format(Specificity*100))

    Sensitivity = (cm[1][1]) / (cm[1][1] + cm[1][0])
    print("\nSensitivity is: {0:.2f}%".format(Sensitivity*100))

![](https://myvetzone.com/wp-content/uploads/2017/11/SeSp-Sen-and-Spec-VetZone.jpg)

### Thus we can define the two parameters in detection as:
### Sensitivity: The ability of a test to correctly identify people with a disease.
### Specificity: The ability of a test to correctly identify people without the disease.

## Model Selection

## Logistic Regression

In [None]:
# Use Logistic Selection Model for classification
from sklearn.linear_model import LogisticRegression

grid_values = {'penalty': ['l1','l2'], 'C': [0.001,0.01,0.1,1,10,100,1000]}
classifier_lr = GridSearchCV(
                cv=None,
                estimator=LogisticRegression(random_state = 0, penalty='l1', solver='liblinear'), 
                param_grid=grid_values)

classifier_lr.fit(X_train, y_train)
pred_lr = classifier_lr.predict(X_test)

#Printing the best hyperparameters
print('Best paramters for Logistic Regression: ', classifier_lr.best_params_)

In [None]:
#Prediction of Logistic Regression
plot_confusion(pred_lr)

## K Nearest Neighbours

In [None]:
# Use KNeighbors Model for classification
from sklearn.neighbors import KNeighborsClassifier

grid_values = {'n_neighbors': [3, 5, 9, 11],
               'weights': ['uniform', 'distance'],
               'metric': ['euclidean', 'manhattan']
              }

classifier_knn = GridSearchCV(
                 KNeighborsClassifier(),
                 param_grid=grid_values
                )

classifier_knn.fit(X_train, y_train)
pred_knn = classifier_knn.predict(X_test)

#Printing the best hyperparameters
print('Best paramters for KNN: ', classifier_knn.best_params_)

In [None]:
#Prediction of K nearest Neighbours
plot_confusion(pred_knn)

## Naive Bayes

In [None]:
#Using GaussianNB method to use Naive Bayes Algorithm
from sklearn.naive_bayes import GaussianNB

#No need for GridSearch as of now since GaussianNB does not accept parameters
classifier_nb = GaussianNB()

classifier_nb.fit(X_train, y_train)
pred_nb = classifier_nb.predict(X_test)

In [None]:
#Prediction of Naive Bayes
plot_confusion(pred_nb)

## Decision Tree

In [None]:
# Using Decision Tree Model for Classification
from sklearn.tree import DecisionTreeClassifier

grid_values = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15]}

classifier_dt = GridSearchCV(
                 DecisionTreeClassifier(),
                 grid_values)

classifier_dt.fit(X_train, y_train)
pred_dt = classifier_dt.predict(X_test)

print('Best paramters for Decision Tree: ', classifier_dt.best_params_)

In [None]:
#Prediction of Decision Tree
plot_confusion(pred_dt)

## SVC

In [None]:
#Using Support Vector Machine for Classification
from sklearn.svm import SVC

grid_values = {'kernel': ['rbf', 'linear', 'poly'], 
               'C': [0.1, 1, 10, 100, 1000], 
               'gamma': [1, 0.1, 0.01, 0.001, 0.0001]}

classifier_svc = GridSearchCV(
                 SVC(),
                 grid_values)

classifier_svc.fit(X_train, y_train)
pred_svc = classifier_svc.predict(X_test)

print('Best paramters for SVC: ', classifier_svc.best_params_)

In [None]:
#Prediction of Support Vector
plot_confusion(pred_svc)

## Random Forest 

In [None]:
#Using Random Forest Model for Classification
from sklearn.ensemble import RandomForestClassifier

grid_values = {'bootstrap': [True], 
               'max_depth': [5, 10, None], 
               'max_features': ['auto', 'log2'], 
               'n_estimators': [5, 6, 7, 8, 9, 10, 11, 12, 13, 15]}

classifier_rf = GridSearchCV(
                 RandomForestClassifier(),
                 grid_values)

classifier_rf.fit(X_train, y_train)
pred_rf = classifier_rf.predict(X_test)

print('Best paramters for Random Forest: ', classifier_rf.best_params_)

In [None]:
#Prediction of Random Forest
plot_confusion(pred_rf)

# Conclusion

It is clear and obvious that the Logistic Regression model is the most accurate for prediction of Breast Cancer w.r.t the dataset collected by Dr. William H. Wolberg, physician at the University Of Wisconsin Hospital at Madison, Wisconsin, USA.

- The final model has a commendable Specificity of 99.07%, implying that the model can correctly identify 99% of the people without Breast Cancer.
- The model also has a Sensitivity of 93.65%, implying that the model can correctly identify approximately 94% of the people with Breast Cancer.