# Introduction

This is a breast cancer prediction project

By answering the task 'How to predict the breast caner?', I am using the KNN Classifier to make a prediction.

The data source is from [UCI Machine Learning Repository.](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import operator   # Python 內建 Standard operators as functions

# Read Data

In [None]:
#Read data
df=pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv',index_col='id')
df.head() #display the first 5 rows of the dataset

In [None]:
df.info() #information of the dataset

In [None]:
df.describe() #description of the dataset

In [None]:
df.corr() #display the correlation of each variables in the dataset

# Data Preprocess

The column 'Unnamed: 32' can be deleted, because of the missings.

In [None]:
#Drop columns
df= df.drop(['Unnamed: 32'], axis = 1)
df.head()  #display the first 5 rows of the dataset

### Show the no. of diagnosis

In [None]:
sns.catplot(x='diagnosis',data= df, kind= 'count')
plt.show()

### Getting Features and Predict Value

In [None]:
x= df.drop('diagnosis', axis = 1)
y= df['diagnosis']

### Split Data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y ,test_size=0.2, random_state=24)
print('x_train:', len(x_train))
print('x_test:', len(x_test))
print('y_train:', len(y_train))
print('y_test:', len(y_test))

# KNN Model

### Finding the appropriate K value of KNN model

The K value is really important for the KNN model.
Therefore, I'll calculate the error rate and make a line plot to find the best value of K.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
error_rate = []

for i in range (2, 11):
  knn = KNeighborsClassifier(n_neighbors=i)
  knn.fit(x_train, y_train)
  pred = knn.predict(x_test)
  error_rate.append(np.mean(pred != y_test)) 
    
plt.figure(figsize = (10,6))

plt.plot(range(2,11), error_rate, color='red', marker='o', markerfacecolor='green',markersize=10)

plt.title('Error Rate v.s. K')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.show()

As the plot shows, an appropriate K value can be 5, 6, 7, 8, 9 or 10.

In [None]:
K = 5
knn= KNeighborsClassifier(n_neighbors=K)
knn.fit(x_train,y_train)

pred_knn = knn.predict(x_test)
pred_knn

# Model Validation

I am going to evaluate the model by using classification report, confusion matrix, and accuracy score.

In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

In [None]:
#Classification Report
print(classification_report(y_test,pred_knn))

#Confusion Matrix
confusion_matrix(y_test, pred_knn)
plt.title('K Nearest Neighbors Confusion Matrix')
sns.heatmap(confusion_matrix(y_test, pred_knn), annot= True, cmap='Blues', cbar=False, annot_kws={'size':24})
plt.xlabel('predicted label')
plt.ylabel('true Label')
plt.show()

#Accuracy score
print('Accuracy Score: ' ,accuracy_score(y_test, pred_knn))



**I get an accuracy score of 95.61 % using KNeighborsClassifier**.