In this notebook, I will build a K-NN classification model to predict  whether breast cancer is malignant or benign based on the features computed from a digitized image of a fine needle aspirate of a breast mass. The data is from the Breast Cancer Wisconsin data set. I will aslo use seaborn to visualize the data and help select the best features for my model.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
# importing other libraries I will need
import matplotlib.pyplot as plt
import seaborn as sns
# Other:
import warnings
warnings.filterwarnings('ignore')
sns.set_style('darkgrid')

In [None]:
# importing the dataset
data = pd.read_csv("../input/data.csv")

**Exploratory Data Analysis**

First, lets get to know the data. 

In [None]:
data.head()

I see two columns that would be useless in my further analysis: the id column is not relevant, and the last column is empty. I will drop them.

In [None]:
# Drop useless variables
data = data.drop(['Unnamed: 32','id'],axis = 1)

In [None]:
data.shape

My dataset consists of 31 columns and 569 observations. Lets see if there is any missing data.

In [None]:
# checking for missing values
data.isna().any()

There are almost twice as many benign cases in the dataset as malignant ones:

In [None]:
f,ax = plt.subplots(figsize=(10,5))
sns.countplot(y = data['diagnosis'], palette = "husl", ax=ax)

Before analyzing further, I need to normlalize the data in the features columns to be able to adequately present it in plots.

In [None]:
features = data.iloc[:, 1:]

In [None]:
from sklearn.preprocessing import MinMaxScaler
x = features.values #returns a numpy array
min_max_scaler = MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
feat = pd.DataFrame(x_scaled, index = features.index, columns = features.columns)

In [None]:
feat.shape

In [None]:
diag = data.iloc[:,0]

Now that the features have been scaled, lets take a closer look at them. 

In [None]:
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(feat.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

Looks like we have some highly correlated features. Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. When two features have high correlation, we should drop one. I will identify and drop them.

In [None]:
# Create correlation matrix
corr_matrix = feat.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.9
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]
to_drop

In [None]:
# Drop features 
feat = feat.drop(to_drop, axis=1)
feat.shape

Putting my dataframe back together: the diagnosis column first, the features deemed not too highly correlated next.

In [None]:
df = pd.concat([diag, feat], axis=1, sort=False)
df.head()

In [None]:
# rewriting the categorical values in the target column as numerical
y = data['diagnosis'].apply(lambda x: 1 if 'M' in x else 0)
y.head()

# putting the dataframe back together
df_encoded = pd.concat([y, feat], axis=1, sort=False)
df_encoded.head()

Lets see how each predictor variable varies by diagnosis. For a lot of the predictor variables, average values are higher in the malignant group. There are also plenty of outliers, especially in the benign data.

In [None]:
import math

vars = df_encoded.drop('diagnosis', axis = 1).keys()
plot_cols = 5
plot_rows = math.ceil(len(vars)/plot_cols)

plt.figure(figsize = (5*plot_cols,5*plot_rows))

for idx, var in enumerate(vars):
    plt.subplot(plot_rows, plot_cols, idx+1)
    sns.boxplot(x = 'diagnosis', y = var, data = df_encoded)

Lets see how the not so highly correlated features relate to each other. Looks like diagnosis is most highly correlated with the radius (radius_mean and radius_se), concavity (concavity_mean and concavity_worst), and compactness (compactness_mean and compactness_worst).

In [None]:
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(df_encoded.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)

My suspicion is that compactness and concavity are still too highly correlated in all three groups (mean, se, and worst) to include them in the model. Lets check if I am right.

In [None]:
df_select = df.iloc[:,0:8]
sns.pairplot(df_select, hue = 'diagnosis')

In [None]:
df_select = df.iloc[:,[0,8,9,10,11,12,13,14,15]]
sns.pairplot(df_select, hue = 'diagnosis')

In [None]:
df_select = df.iloc[:,[0,16,17,18,19,20]]
sns.pairplot(df_select, hue = 'diagnosis')

Looks like compactness and concavity are too highly correlated indeed. I will drop concavity from my choice of selected features.

In [None]:
df_encoded = df_encoded.drop('concavity_mean', axis = 1)
df_encoded = df_encoded.drop('concavity_se', axis = 1)
df_encoded = df_encoded.drop('concavity_worst', axis = 1)

In [None]:
df_encoded.head()

In [None]:
df_encoded.shape

My final choice for the model are the 17 features.

**K-NN Classifier**

In [None]:
X = df_encoded.iloc[:,1:].values
y = df_encoded.iloc[:,0].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Fitting K-NN to the Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

In [None]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [None]:
# Making the confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
# checking the accuracy score
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

The model is 92% accurate. I will use K-fold cross-validation to better evaluate my model's performance.

In [None]:
# Applying 10-fold cross-validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)

In [None]:
accuracies.mean()

After performing cross-validation, the model seems to be 95% accurate. However, I picked the initial number of K nearest neighbors and weights at random. Lets see if we can improve the model's performance even better by finding the optimal hyperparameters through gridsearch.

In [None]:
# Applying grid search to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV
# specifying the parameters I want to find optimal values for
parameters = [{'n_neighbors': [3,5,8,10], 'weights':['uniform']}, 
              {'n_neighbors': [3,5,8,10], 'weights':['distance']}
             ]
grid_search = GridSearchCV(estimator = classifier,
                          param_grid = parameters,
                          scoring = 'accuracy',
                          cv = 10)
grid_search = grid_search.fit(X_train, y_train)

In [None]:
best_accuracy = grid_search.best_score_
best_accuracy

Looks like the best accuracy score I could get for my model is 95.16%, which is not much better from the already achieved 95.13%. Let's check which parameter choices would assure the highest possible accuracy.

In [None]:
best_parameters = grid_search.best_params_
best_parameters

The most appropriate parameters for our K-NN model are the initially selected uniform as weights and K=5. 

The default data included 33 features. By selecting the most appropriate 17 features, I was able to build a classification model with a 95% accuracy. 