# Predicting Breast Cancer - KNN Classification 
---
**Bashir Abubakar**

# Introduction

The contents of this notebook:
1. **The Data** - *Exploratory Data Analysis*
2. **The Variables** - *Feature Selection*
3. **The Model** - *Building a Logistic Regression Model*
4. **The Prediction** - *Making Predictions with the Model*

**Let's explore the Breast Cancer dataset and develop a KNN Classifier model to predict classification of suspected cells to Benign or Malignant.**

# Data
---
*Extracted from the popular [UCI ML repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)*

### Attribute Information:

* **id** 
* **diagnosis**: M = malignant, B = benign

*Columns 3 to 32* 

Ten real-valued features are computed for each cell nucleus: 

* **radius**: distances from center to points on the perimeter 
* **texture**: standard deviation of gray-scale values
* **perimeter** 
* **area** 
* **smoothness**: local variation in radius lengths 
* **compactness**: perimeter^2 / area - 1.0 
* **concavity**: severity of concave portions of the contour
* **concave points**: number of concave portions of the contour
* **symmetry** 
* **fractal dimension**: "coastline approximation" - 1

The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.  For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

---

In [None]:
print('Hello My name is Bashir Abubakar and welcome to this exploration!')

In [None]:
# import necessary libraries
# data cleaning and manipulation 
import pandas as pd
import numpy as np

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
!pip install chart_studio
!pip install cufflinks
from chart_studio.plotly import plot, iplot
from plotly.offline import init_notebook_mode, iplot
import plotly.figure_factory as ff
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
import plotly.graph_objs as go
import chart_studio.plotly as py
import plotly
import chart_studio
chart_studio.tools.set_credentials_file(username='bashman18', api_key='••••••••••')
init_notebook_mode(connected=True)
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.tools as tls
import itertools
import time

# machine learning
from sklearn.preprocessing import StandardScaler
import sklearn.linear_model as skl_lm
import sklearn.metrics as metrics
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn import neighbors
from sklearn.metrics import confusion_matrix, classification_report, precision_score
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.feature_selection import RFE
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score, learning_curve, train_test_split
from sklearn.metrics import precision_score, recall_score, confusion_matrix, roc_curve, precision_recall_curve, accuracy_score

# initialize some package settings
sns.set(style="whitegrid", color_codes=True, font_scale=1.3)

%matplotlib inline

print('All modules imported')

In [None]:
# read in the data and check the first 10 rows
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df.head(10)

The last column, **Unnamed:32**, seems like it has a lot of missing values. Let's quickly check for any missing values for other columns as well.

In [None]:
# general summary of the dataframe
df.info()

In [None]:
# check number of missing values
null_feat = pd.DataFrame(len(df['id']) - df.isnull().sum(), columns = ['Count'])
null_feat

It looks like our data does not contain any missing values, except for our suspect column **Unnamed: 32**, which is full of missing values. Let's go ahead and remove this column entirely. After that, let's check for the data type of each column.

In [None]:
# remove the 'Unnamed: 32' column
df = df.drop('Unnamed: 32', axis=1)
# Reassign target
df.diagnosis.replace(to_replace = dict(M = 1, B = 0), inplace = True)

In [None]:
# check the data type of each column
df.dtypes

Our response variable, **diagnosis**, is categorical and has two classes,  'B' (Benign) and 'M' (Malignant). All explanatory variables are numerical, so we can skip data type conversion.

Let's now take a closer look at our response variable, since it is the main focus of our analysis. We begin by checking out the distribution of its classes.

In [None]:
# drop the id column as well and check the dataframe
df=df.drop("id",axis=1)
df.head()

In [None]:
# assign our categorical variables to a dataframe
M = df[(df['diagnosis'] != 0)]
B = df[(df['diagnosis'] == 0)]

In [None]:
# check what the dataframe looks like
df.head()

In [None]:
trace = go.Bar(x = (len(M), len(B)), y = ['malignant', 'benign'], orientation = 'h', opacity = 0.8, marker=dict(
        color=[ 'gold', 'black'],
        line=dict(color='#000000',width=1.0)))

layout = dict(title =  'Count of diagnosis variable')
                    
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

trace = go.Pie(labels = ['benign','malignant'], values = df['diagnosis'].value_counts(), 
               textfont=dict(size=15), opacity = 0.8,
               marker=dict(colors=['black', 'gold'], 
                           line=dict(color='#000000', width=1.5)))

layout = dict(title =  'Distribution of diagnosis variable')
           
fig = dict(data = [trace], layout=layout)
py.iplot(fig)

Our response variable 'diagnosis', is categorical and has two classes, 'B' (Benign) and 'M' (Malignant). All explanatory variables are numerical, so we can skip data type conversion.

Let's now take a closer look at our response variable, since it is the main focus of our analysis. We begin by checking out the distribution of its classes.

In [None]:
benign, malignant = df['diagnosis'].value_counts()
print('Number of cells labeled Benign: ', benign)
print('Number of cells labeled Malignant : ', malignant)
print('')
print('% of cells labeled Benign', round(benign / len(df) * 100, 2), '%')
print('% of cells labeled Malignant', round(malignant / len(df) * 100, 2), '%')

Out of the 569 observations, 357 (or 62.7%) have been labeled malignant, while the rest 212 (or 37.3%) have been labeled benign. Later when we develop a predictive model and test it on unseen data, we should expect to see a similar proportion of labels.

Although our dataset has 30 columns excluding the **id** and the **diagnosis** columns, they are all in fact very closely related since they all contain information on the same 10 key attributes but only differ in terms of their perspectives (i.e., the mean, standard errors, and the mean of the three largest values denoted as "worst"). 

In this sense, we could attempt to dig out some quick insights by analyzing the data in only one of the three perspectives. For instance, we could choose to check out the relationship between the 10 key attributes and the **diagnosis** variable by only choosing the "mean" columns.

Let's quickly scan for any interesting patterns between our 10 "mean" columns and the response variable by generating a scatter plot matrix as shown below:

In [None]:
def plot_distribution(df_f, size_bin) :  
    tmp1 = M[df_f]
    tmp2 = B[df_f]
    hist_data = [tmp1, tmp2]
    
    group_labels = ['malignant', 'benign']
    colors = ['#FFD700', '#7EC0EE']

    fig = ff.create_distplot(hist_data, group_labels, colors = colors, show_hist = True, bin_size = size_bin, curve_type='kde')
    
    fig['layout'].update(title = df_f)

    py.iplot(fig, filename = 'Density plot')

In [None]:
plot_distribution('radius_mean', .5)

In [None]:
plot_distribution('texture_mean', .5)

In [None]:
plot_distribution('perimeter_mean', 5)

In [None]:
plot_distribution('area_mean', 10)
#plot_distribution('smoothness_mean', .5)
#plot_distribution('compactness_mean' .5)
#plot_distribution('concavity_mean' .5)
#plot_distribution('concave points_mean' .5)
#plot_distribution('symmetry_mean' .5)
#plot_distribution('fractal_dimension_mean' .5)

There are some interesting patterns visible. For instance, the almost perfectly linear patterns between the radius, perimeter and area attributes are hinting at the presence of multicollinearity between these variables. Another set of variables that possibly imply multicollinearity are the concavity, concave_points and compactness.

We will generate a matrix similar to the one above, but this time displaying the correlations between the variables. Let's find out if our hypothesis about the multicollinearity has any statistical support.

In [None]:
#correlation
correlation = df.corr()
#tick labels
matrix_cols = correlation.columns.tolist()
#convert to array
corr_array  = np.array(correlation)

In [None]:
trace = go.Heatmap(z = corr_array,
                   x = matrix_cols,
                   y = matrix_cols,
                   xgap = 2,
                   ygap = 2,
                   colorscale='Viridis',
                   colorbar   = dict() ,
                  )
layout = go.Layout(dict(title = 'Correlation Matrix for variables',
                        autosize = False,
                        height  = 720,
                        width   = 800,
                        margin  = dict(r = 0 ,l = 210,
                                       t = 25,b = 210,
                                     ),
                        yaxis   = dict(tickfont = dict(size = 9)),
                        xaxis   = dict(tickfont = dict(size = 9)),
                       )
                  )
fig = go.Figure(data = [trace],layout = layout)
py.iplot(fig)

Looking at the matrix, we can immediately verify the presence of multicollinearity between some of our variables. For instance, the radius_mean column has a correlation of 1 and 0.99 with perimeter_mean and area_mean columns, respectively. This is probably because the three columns essentially contain the same information, which is the physical size of the observation (the cell). Therefore we should only pick one of the three columns when we go into further analysis.

Another place where multicollienartiy is apparent is between the "mean" columns and the "worst" column. For instance, the radius_mean column has a correlation of 0.97 with the radius_worst column. In fact, each of the 10 key attributes display very high (from 0.7 up to 0.97) correlations between its "mean" and "worst" columns. This is somewhat inevitable, because the "worst" columns are essentially just a subset of the "mean" columns; the "worst" columns are also the "mean" of some values (the three largest values among all observations). Therefore, I think we should discard the "worst" columns from our analysis and only focus on the "mean" columns when training our model.

### Positive correlated features

Let's check the correlation between few features by pair

In [None]:
def plot_ft1_ft2(ft1, ft2) :  
    trace0 = go.Scatter(
        x = M[ft1],
        y = M[ft2],
        name = 'malignant',
        mode = 'markers', 
        marker = dict(color = '#FFD700',
            line = dict(
                width = 1)))

    trace1 = go.Scatter(
        x = B[ft1],
        y = B[ft2],
        name = 'benign',
        mode = 'markers',
        marker = dict(color = '#7EC0EE',
            line = dict(
                width = 1)))

    layout = dict(title = ft1 +" "+"vs"+" "+ ft2,
                  yaxis = dict(title = ft2,zeroline = False),
                  xaxis = dict(title = ft1, zeroline = False)
                 )

    plots = [trace0, trace1]

    fig = dict(data = plots, layout=layout)
    py.iplot(fig)

In [None]:
plot_ft1_ft2('perimeter_mean','radius_worst')
plot_ft1_ft2('area_mean','radius_worst')
plot_ft1_ft2('texture_mean','texture_worst')
plot_ft1_ft2('area_worst','radius_worst')

#### Uncorrelated features

In [None]:
plot_ft1_ft2('smoothness_mean','texture_mean')
plot_ft1_ft2('radius_mean','fractal_dimension_worst')
plot_ft1_ft2('texture_mean','symmetry_mean')
plot_ft1_ft2('texture_mean','symmetry_se')

In [None]:
plot_ft1_ft2('area_mean','fractal_dimension_mean')
plot_ft1_ft2('radius_mean','fractal_dimension_mean')
plot_ft1_ft2('area_mean','smoothness_se')
plot_ft1_ft2('smoothness_se','perimeter_mean')

### The Model
---

It's finally time to develop our model! We will start by first splitting our dataset into two parts; one as a training set for the model, and the other as a test set to validate the predictions that the model will make. If we omit this step, the model will be trained and tested on the same dataset, and it will underestimate the true error rate, a phenomenon known as overfitting. It is like writing an exam after taking a look at the questions and answers beforehand. We want to make sure that our model truly has predictive power and is able to accurately label unseen data. We will set the test size to 0.3; i.e., 70% of the data will be assigned to the training set, and the remaining 30% will be used as a test set. In order to obtain consistent results, we will set the random state parameter to a value of 40.

### KNN Classification on all Features

Lets check how the model performs on all features when we make predictions 

In [None]:
df.head()

In [None]:
# define X, y functions for our model
X=df.drop('diagnosis',axis=1)
X.head()

In [None]:
y=df['diagnosis']
y.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 1)

Now that we have split our data into appropriate sets, let's write the code to be used for the KNN Classifier.

**Since KNN is a distance based Algorithm - we need to do standardization of values with standard scaler**

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3) 

start_time=time.time()

clf.fit(X_train, y_train)

end_time=time.time()

print("---%s seconds ---" % (end_time - start_time))

print(clf.score(X_test, y_test))

**We have finally developed our KNN classifier model, now lets print the scores to determine the accuracy**

In [None]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,roc_auc_score
# Validating the train on the model
y_train_pred =clf.predict(X_train)
y_train_prob =clf.predict_proba(X_train)[:,1]

print("Accuracy Score of train", accuracy_score(y_train,y_train_pred))
print("AUC of the train ", roc_auc_score(y_train,y_train_prob))
print(" confusion matrix \n" , confusion_matrix(y_train,y_train_pred))

In [None]:
# Validating the test on the model
y_test_pred =clf.predict(X_test)
y_test_prob =clf.predict_proba(X_test)[:,1]

print("Accuracy Score of test", accuracy_score(y_test,y_test_pred))
print("AUC of the test ", roc_auc_score(y_test,y_test_prob))
print(" confusion matrix \n" , confusion_matrix(y_test,y_test_pred))

In [None]:
y_pred_proba =clf.predict_proba(X_test)[:,1]

In [None]:
roc = roc_auc_score(y_train,y_train_prob)
roc

**Check the confusion matrix array to determine the number of accurate samples predicted**

In [None]:
cm_2 = confusion_matrix(y_train,y_train_pred)
cm_2

In [None]:
 confusion_mat = confusion_matrix(y_test,y_test_pred)

In [None]:
import plotly.figure_factory as ff
fig = ff.create_annotated_heatmap(cm_2)

# add title
fig.update_layout(title_text='<i><b>Confusion matrix</b></i>')

# adjust margins to make room for yaxis title
fig.update_xaxes(side="top")

# add colorbar
fig['data'][0]['showscale'] = True
fig.show()

In [None]:
y_test_prob

Determine the optimum value of K

In [None]:
#Find Optimum K value
scores = []
for each in range(1,15):
    KNNfind = KNeighborsClassifier(n_neighbors = each)
    KNNfind.fit(X_train,y_train)
    scores.append(KNNfind.score(X_test,y_test))
    
plt.plot(range(1,15),scores,color="black")
plt.xlabel("K Values")
plt.ylabel("Score(Accuracy)")
plt.show()

In [None]:
y_test_pred

### Feature Importances
**Let's determine the features that are relevant to our analysis using the correlation technique with a bar chart as seen below**

In [None]:
tst = df.corr()['diagnosis'].copy()
tst = tst.drop('diagnosis')
tst.sort_values(inplace=True)
tst.iplot(kind='bar',title='Feature Importances',xaxis_title="Features",
    yaxis_title="Correlation")

### Model Performance

In [None]:
def model_performance(clf,X_train,X_test,
                                 y_train,y_test) :
    
    #model
    clf.fit(X_train, y_train)
    y_test_pred =clf.predict(X_test)
    y_test_prob =clf.predict_proba(X_test)[::,1]


    
    print (clf)
    print ("\n Classification report : \n",classification_report(y_test,y_test_pred))
    print ("Accuracy Score   : ",accuracy_score(y_test,y_test_pred))
    #confusion matrix
    conf_matrix = confusion_matrix(y_test,y_test_pred)
    #roc_auc_score
    model_roc_auc = roc_auc_score(y_test,y_test_pred) 
    print ("Area under curve : ",model_roc_auc)
    fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
     
    #plot roc curve
    trace1 = go.Scatter(x = fpr,y =tpr ,
                        name = "Roc : " + str(model_roc_auc),
                        line = dict(color = ('rgb(22, 96, 167)'),width = 2),
                       )
    trace2 = go.Scatter(x = [0,1],y=[0,1],
                        line = dict(color = ('rgb(205, 12, 24)'),width = 2,
                        dash = 'dot'))
    
    #plot confusion matrix
    trace3 = go.Heatmap(z = conf_matrix ,x = ["Accurate","Inaccurate"],
                        y = ["Accurate","Inaccurate"],
                        showscale  = False,colorscale = "Blues",name = "matrix",
                        xaxis = "x2",yaxis = "y2"
                       )
    
    layout = go.Layout(dict(title="Model performance" ,
                            autosize = False,height = 500,width = 1000,
                            showlegend = False,
                            plot_bgcolor  = "rgb(243,243,243)",
                            paper_bgcolor = "rgb(243,243,243)",
                            xaxis = dict(title = "false positive rate",
                                         gridcolor = 'rgb(255, 255, 255)',
                                         domain=[0, 0.6],
                                         ticklen=5,gridwidth=2),
                            yaxis = dict(title = "true positive rate",
                                         gridcolor = 'rgb(255, 255, 255)',
                                         zerolinewidth=1,
                                         ticklen=5,gridwidth=2),
                            margin = dict(b=200),
                            xaxis2=dict(domain=[0.7, 1],tickangle = 90,
                                        gridcolor = 'rgb(255, 255, 255)'),
                            yaxis2=dict(anchor='x2',gridcolor = 'rgb(255, 255, 255)')
                           )
                  )
    data = [trace1,trace2,trace3]
    fig = go.Figure(data=data,layout=layout)
    
    py.iplot(fig)

In [None]:
model_performance(clf,X_train,X_test,y_train,y_test)

The result is telling us that we have 108+57 correct predictions and 6+0 incorrect predictions.

In [None]:
X_train

In [None]:
X_test

We have successfully developed a KNN Classifier model. This model can take some unlabeled data and effectively assign each observation a probability ranging from 0 to 1. However, for us to evaluate whether the predictions are accurate, the predictions must be encoded so that each instance can be compared directly with the labels in the test data. In other words, instead of numbers between 0 or 1, the predictions should show "M" or "B", denoting malignant and benign respectively. In our model, a probability of 1 corresponds to the "Benign" class, whereas a probability of 0 corresponds to the "Malignant" class. Therefore, we can apply a threshhold value of 0.5 to our predictions, assigning all values closer to 0 a label of "M" and assigniing all values closer to 1 a label of "B".

In [None]:
y_test_pred[1:6]

In [None]:
y_test_pred = ["M" if x < 0.5 else "B" for x in y_test_pred]

In [None]:
y_test_pred[1:6]

We can confirm that probabilities closer to 0 have been labeled as "M", while the ones closer to 1 have been labeled as "B".

This is the end of our exploration.

<img src="https://content.linkedin.com/content/dam/brand/site/img/logo/logo-tm.png"/>

# Let's Connect on LinkedIn!
If anybody would like to discuss any other projects or just have a chat about data science topics, I'll be more than happy to connect with you on **LinkedIn:**
https://www.linkedin.com/in/bashir-abubakar-61935417b/