

# Breast Cancer Wisconsin


**Breast cancer is the second leading cause of mortality and the most common type of cancer
among women. Approximately 2.1 million women are globally diagnosed with breast cancer every year. However, early and accurate diagnosis of this disease increases the effectiveness of 
cancer treatment, thereby increasing survival rates.**


**Fine needle aspiration (FNA) is the most common method to diagnose breast cancer. A clinician examines a sample
under a microscope and classifies the sample as benign or malignant. However, 
this method could result in false negatives and false positives through human error. Machine learning can facilitate effective diagnosis of breast cancer through supervised classification algorithms.**

**This notebook conducts exploratory data analysis, data cleaning and predictive modelling using the Breast Cancer Wisconsin Dataset. The objective is to accurately classify benign and malignant samples.**


**The visual characteristics of the digitized samples are described in terms of the size and shape of each cell 
which are the input variables enumerated in the list below.**


1. radius (mean of distances from center to points on the perimeter)
2. texture (standard deviation of gray-scale values)
3. perimeter
4. area
5. smoothness (local variation in radius lengths)
6. compactness (perimeter^2 / area - 1.0)
7. concavity (severity of concave portions of the contour)
8. concave points (number of concave portions of the contour)
9. symmetry
10. fractal dimension ("coastline approximation")

# 1.  Importing Libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import imblearn
from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

In [None]:
# Importing the dataset and drop'id' and 'unnamed' columns as they are irrelevent.
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv').drop(columns=['id','Unnamed: 32'])

# Subsets for predictors and target variable.
X =  df.drop(columns=['diagnosis'])
y = df['diagnosis']

#  2. Exploratory Data Analysis 




**This section employs univariate, bivariate and multivariate analysis using visualizations and summary statistics to detect any trends and patterns within the dataset.**

In [None]:
# Summary Statistics
df.describe()

In [None]:
# All independent variables are continous data type.
df.info()

**All independent variables are numeric. There are therefore 30 independent variables that will be used in the analysis. These variables can be divided into 
divided into 3 sections. The first section indicates the mean values of each image, the 
second section indicates the standard error values of each image, while the third indicates worst values. Moreover, there are
ten real-valued independent features for each digitized cell nucleus image. These include radius, 
texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and 
fractal dimension. Lets take a deeper look into these variables through a univariate and bivariate analysis.** 

# 2.1 Univariate Analysis

In [None]:
# Lets create a barchart of the response variable
sns.set_style('ticks')
sns.set_palette('Set1')
sns.countplot(data=df, x="diagnosis", order=["M", "B"], palette='Set1')
plt.xlabel('Diagnosis')
plt.ylabel('Frequency')
plt.ylim([0, 500])
plt.show()


# Proportion of Malignant Vs Benign 
x = df['diagnosis'].value_counts(normalize=True)
print("Percentage of Benign Observations: ", str(x['B'].round(3) * 100), '%')
print("Percentage of Malignant Observations:", str(x['M'].round(3) * 100), '%')

**The response variable is not balanced as the frequency of benign observations is greater than malignant ones. Resampling could be a potential solution.**

In [None]:
# Kernal Density plots to see the distribution of independent variables.
variables = list(df.columns)
sns.set(style="white") 
plt.figure(figsize = (20 , 60))
independent = variables[1:]
for variable in range(30):
    plt.subplot(15,3 , variable + 1 )
    sns.kdeplot(df[independent[variable]], shade = True, color="olive")
plt.show()

**Some variables indicate positive skewness. It would be interesting to see if there are differences in these independent features with respect to malignant and benign cancer cells. This will be carried out using boxplots and barcharts in the subsequent bivariate analysis section.**

# 2.2 Bivariate Analysis

**This section will employ bivariate visualisations such as bar-charts and box-plots to detect any noticible patterns in malignant and benign tissue samples.**

In [None]:
# Barcharts representing malignant and benign cancer cells against independent variables.
sns.set(style="white") 
plt.figure(figsize = (20 , 60))
for variable in range(30):
    plt.subplot(15, 3 , variable + 1)
    sns.barplot(x = df['diagnosis'], y =df[independent[variable]],  palette='Set1' )
plt.show()

In [None]:
# Boxplots representing malignant and benign cancer cells against independent variables.
sns.set(style="white") 
plt.figure(figsize = (20 , 60))
for variable in range(30):
    plt.subplot(15, 3 , variable + 1)
    sns.boxplot(x = df['diagnosis'], y =df[independent[variable]],  palette='Set1' )
plt.show()

**There is a stark difference in the properties of malignant and benign tissue samples as indicated in the visualizations above. Particularly all the malignant size related features have greater measures of central tendency compared with benign cells. Since majority of the variables relate to shapes and sizes of tissue samples, there is bound to be some multicollinearity amongst features. It would be worth conducting pearsons correlation tests to detect collinearity.**

In [None]:
# Let's produce a pearsons correlation correlogram to detect multicollinearity
correlated_var = X.corr()
plt.figure(figsize = (25 , 25))
triangle = np.triu(correlated_var)
colormap = sns.color_palette("Greens")
sns.heatmap(correlated_var, annot = True, center = 0, linecolor = 'black', mask = triangle,fmt='.2f', cmap = colormap)

**The heatmap indicates strong presense of multicollinearity amongst some independent variables. Predictors related to area, perimeter and radius have correlation coefficients upto 0.9. To conduct feature selection, I will remove several variables in the subsequent section.**

# 3. Data Preprocessing

**In this section several preprocessing steps such as feature selection, label encoding, normalization and resampling are carried out.**

In [None]:
# Removing highly correlated features
var = ['perimeter_mean',
       'radius_mean',
       'radius_worst',
       'texture_mean',
       'radius_se',
       'area_se',
       'concave points_mean']

X = X.drop(var, axis = 1)
X = np.array(X)

In [None]:
# Encode 0 and 1 to represent benign and malignant cancer cells  
dict_map =     {'M': 1, 
              'B': 0}

y = y.map(dict_map)

In [None]:
# Splitting the dataset into the Training set and Test set at a ratio of 80% to 20%
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = True, stratify = None)

In [None]:
# Performing Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
# Performing oversampling on training set using SMOTE
counter = Counter(y_train)
print(counter)

# Synthetic Minority Oversampling Technique
X_train, y_train = SMOTE().fit_resample(X_train, y_train)
Counter(y_train).items()

# 4. Predictive Modelling

In [None]:
# Training the Logistic Regression model on the Training set
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

# Training
classifier.fit(X_train, y_train)
# Extract predictions
y_pred = classifier.predict(X_test)

# k-Fold Cross Validation
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
lr = accuracies.mean()*100
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
print('\n', classification_report(y_test, y_pred))

In [None]:
# Training the Decision Tree Classification model on the Training set
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Training
classifier.fit(X_train, y_train)
# Extract predictions
y_pred = classifier.predict(X_test)

# k-Fold Cross Validation
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
dt = accuracies.mean()*100
print("\nCV Average Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
print('\n', classification_report(y_test, y_pred))



In [None]:
# Training the K-Nearest Neighbour model on the Training set
classifier = KNeighborsClassifier(n_neighbors = 20, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

# Training
classifier.fit(X_train, y_train)
# Extract predictions
y_pred = classifier.predict(X_test)

# k-Fold Cross Validation
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
knn = accuracies.mean()*100
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
print('\n', classification_report(y_test, y_pred))

In [None]:
# Training the Random Forest Classification model on the Training set
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)

# Training
classifier.fit(X_train, y_train)

# Extract predictions
y_pred = classifier.predict(X_test)

# k-Fold Cross Validation
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
rf = accuracies.mean()*100
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
print('\n', classification_report(y_test, y_pred))

In [None]:
# Training the Naive Bayes model on the Training set
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Training
classifier.fit(X_train, y_train)
# Extract predictions
y_pred = classifier.predict(X_test)

# k-Fold Cross Validation
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
nb = accuracies.mean()*100
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
print('\n', classification_report(y_test, y_pred))

In [None]:
# Training the SVM model on the Training set
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

# Training
classifier.fit(X_train, y_train)
# Extract predictions
y_pred = classifier.predict(X_test)

# k-Fold Cross Validation
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
svc = accuracies.mean()*100
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
print('\n', classification_report(y_test, y_pred))

In [None]:
# Training XGBoost on the Training set
classifier = XGBClassifier(use_label_encoder=False, eval_metric = 'logloss')

# Training
classifier.fit(X_train, y_train)
# Extract predictions
y_pred = classifier.predict(X_test)

# k-Fold Cross Validation
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
xg = accuracies.mean()*100
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))
print('\n', classification_report(y_test, y_pred))

In [None]:
# List out Model Accuracies in Descending Order.
Algorithm = pd.DataFrame({
    
    'Algorithm': ['Logistic Regression', 'Decision Tree', 'K-nearest Neighbour',  'Random Forest Classifier', 
                  'Naive Bayes', 'Support Vector Classifier',  'XgBoost'],
    
    'Accuracy': [lr, dt, knn, rf, nb, svc, xg] })


Algorithm.sort_values(by = 'Accuracy', ascending = False)

**Logistic Regression seems to be the winner in terms of predictive accuracy.** **:D**

# CONCLUSION

**It is imperative for health clinics to apply novel strategies that could aid early classification 
and diagnoses of breast cancer. One of the core objectives of a practitioner is to accurately diagnose cancer patients and minimize instances of false positives and false negatives. This analysis clearly indicates that the integration of machine learning in the field of oncology has the potential to improve the decision-making ability of healthcare clinicians.**




# THANK YOU :D