# Predict breast cancer by classification model - K-NearestNeighbors
[Breast Cancer Wisconsin (Diagnostic) Dast Set on Kaggle](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

## 1. Import data for analysis

In [None]:
import os
import pandas as pd
import numpy as np

os.chdir('/kaggle/input')
os.getcwd()

In [None]:
df=pd.read_csv('breast-cancer-wisconsin-data/data.csv')
#df=pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")
#df.head()
#df.columns
#df.shape #569*33
df.info()  #no missing value 

**Dataset information:**
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

* Dataset Characteristics: Multivariate
* Attribute Characteristics: Real
* Attribute Characteristics: Classification
* Number of Instances: 569
* Number of Attributes: 33 
* Missing Values: No

**Attribute Information**
* id: ID number
* diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)

In [None]:
df.groupby('diagnosis').size() #Diagnosis class distribution: 357 benign, 212 malignant

**Feature names and meanings (4dp)**
* radius_mean: mean of distances from center to points on the perimeter
* texture_mean: standard deviation of gray-scale values
* perimeter_mean: mean size of the core tumor
* area_mean: area of the tumor
* smoothness_mean: mean of local variation in radius lengths
* compactness_mean: mean of perimeter^2 / area - 1.0
* concavity_mean: mean of severity of concave portions of the contour
* concave_points_mean: mean for number of concave portions of the contour
* symmetry_mean
* fractal_dimension_mean: mean for "coastline approximation" - 1
* radius_se: standard error for the mean of distances from center to points on the perimeter
* texture_se: standard error for standard deviation of gray-scale values
* perimeter_se
* area_se
* smoothness_se: standard error for local variation in radius lengths
* compactness_se: standard error for perimeter^2 / area - 1.0
* concavity_se: standard error for severity of concave portions of the contour
* concave_points_se: standard error for number of concave portions of the contour
* symmetry_se
* fractal_dimension_se: standard error for "coastline approximation" - 1
* radius_worst: "worst" or largest mean value for mean of distances from center to points on the perimeter
* texture_worst: "worst" or largest mean value for standard deviation of gray-scale values
* perimeter_worst
* area_worst
* smoothness_worst: "worst" or largest mean value for local variation in radius lengths
* compactness_worst: "worst" or largest mean value for perimeter^2 / area - 1.0
* concavity_worst: "worst" or largest mean value for severity of concave portions of the contour
* concave_points_worst: "worst" or largest mean value for number of concave portions of the contour
* symmetry_worst
* fractal_dimension_worst: "worst" or largest mean value for "coastline approximation" - 1

## 2. Data Cleaning & Wrangling (EDA)


#### 2.1 Drop unnecessary columns
Get rid of "id" and "Unnamed: 32" features since they are irrelevant to diagnose breast cancer

In [None]:
df.drop(["Unnamed: 32","id"],axis=1,inplace=True)
df.head()

#### 2.2 Descriptive Analysis
**check decriptive statistics for features**

In [None]:
df.describe()

In [None]:
#plot outcome variable to see whether suitable for KNN Algorithm or not
M = df[df.diagnosis == "M"]
B = df[df.diagnosis == "B"]

import matplotlib.pyplot as plt
plt.title("Malignant vs Benign Tumor")
plt.xlabel("Radius Mean")
plt.ylabel("Texture Mean")
plt.scatter(M.radius_mean, M.texture_mean, color = "tomato", label = "Malignant", alpha = 0.3)
plt.scatter(B.radius_mean, B.texture_mean, color = "olivedrab", label = "Benign", alpha = 0.3)
plt.legend()
plt.show()

**Convert the diagnosis label from M and B to a dummy variable**
* M (Malignant) = 1
* B (Benign) = 0


In [None]:
df['diagnosis']=np.where(df['diagnosis']=='M',1,0)  

In [None]:
#df.info() #'diagnosis' has changed to int64
df.describe()

**Check relationship between features and outcome variable**
* **1.scatter plot**
* According to the plots on the first row, M and B observations are clearly seperated out in terms of these features, suggesting these features are good predictors that we should put into the model later on.

In [None]:
#df_plot=df[['radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean','compactness_mean','diagnosis']]

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
g=sns.PairGrid(df,hue='diagnosis')  
g.map_offdiag(plt.scatter)  
g.add_legend()
plt.show()  

* **2.correlation of coefficients**
* First row suggests moderate or strong relationship between diagnosis label and other features

In [None]:
#df.corr()
#plot correlation heatmap
plt.figure(figsize=(25, 12))
sns.heatmap(df.corr(), vmin = -1, vmax = 1, center = 0, cmap = 'coolwarm', annot = True)
plt.show()

## 3. Build model - KNN model from Scikit-learn 

### Meaning of KNN Algorithm
* Classify the label of a data point by looking at the 'K' nearest labeled data points
* Taking the majority votes
* :) high accuracy, insensitive to outliers
* :( computationally heavy
* only one parameter: n_neighors - 'K'
* Train the model(Training dataset): .fit()    
* Predict of new data (Testing dataset): .predict()

In [None]:
#split the data into features X and label Y 
#df.info()
X=df.iloc[:,1:30]
Y=df.iloc[:,0]

### 3.1 Manually select 3 records as test dataset 

In [None]:
#select 3 data for prediction 
X_new=X.iloc[200:203]  
Y_new=Y.iloc[200:203]

#KNN 
from sklearn.neighbors import KNeighborsClassifier
knn1 = KNeighborsClassifier(n_neighbors=3)
knn1.fit(X,Y) 
Y_predict1=knn1.predict(X_new) 

print('Prediction Result:{}'.format(Y_predict1))
print('Actual Result:{}'.format(Y_new))

### 3.2 Train/Test Split and Performance Metrics 
* *train_test_split*: helps to split the data for training and testing
* Default Performance Metrics in Scikit-learn for KNN: **accuracy** 
* **accuracy=correct prediction/total no. of prediction**

In [None]:
from sklearn.model_selection import train_test_split 
X_train, X_test, Y_train, Y_test=train_test_split(X,Y,test_size=0.2,random_state=1)  

knn2=KNeighborsClassifier(n_neighbors=3)
knn2.fit(X_train,Y_train)

Y_predict2=knn2.predict(X_test)
print("Test set predictions:{}".format(Y_predict2))

In [None]:
#Evaluation: accuracy
knn2.score(X_test, Y_test) 

### 3.3 K-Fold Cross_Validation
* Model performance is dependent on way the data is split
* Not representative of the model's ability to generalize 用test_size去分出test set
* **Solution: Cross-validation** 
* cv= no. of groups that a given data sample is to be split into
* :) Reflect the true performance of a model
* :( more folds, more computationally expensive

In [None]:
from sklearn.model_selection import cross_val_score
knn=KNeighborsClassifier(n_neighbors=3)
cv_results=cross_val_score(knn, X, Y, cv=5)  
print(cv_results)

In [None]:
#Evaluation:The average accuracy rate of 5 test-train groups
print("The average accuracy rate is:{}".format(np.mean(cv_results)))

## 4. Hyperparameter Tuning: Find out the optimal k
* k-Nearest Neighbors: choosing optimal n_neighbors
* Hyperparameters cannot be learned by fitting the model
* **Solution1: GridSearchCV**
* **Solution2: RandomizedSearch**

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid={'n_neighbors':np.arange(1,50)}   
knn=KNeighborsClassifier()  
knn_best_k=GridSearchCV(knn, param_grid, cv=5) 

knn_best_k.fit(X,Y)

In [None]:
knn_best_k.best_params_
print("Best parameter:",knn_best_k.best_params_)

In [None]:
#the accuracy rate for the best k 
knn_best_k.best_score_
print("Best score:",knn_best_k.best_score_)

## 5. Classification Performance Metrics - Another way to evaluate the model: Confusion Matrix
* Acurracy is not always a useful metric
* If 99% of cancer are Malignant; 1% of cancer are Bengign,could build a classifier that predicts ALL cancer as Malignant - 99% accurate!
* :( But horrible at actually classifying Benign cancer
* :( Fails at its original purpose
* **Solution: Confusion Matrix**
* F1score=2*(precision x recall)/(precision+recall)

In [None]:
#cross validation with confusion matrix
from sklearn.metrics import confusion_matrix 
knn=KNeighborsClassifier(n_neighbors=14) #use the best k we computed above
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=1) 
knn.fit(X_train,Y_train)
Y_predict3=knn.predict(X_test)
print(confusion_matrix(Y_test,Y_predict3))

In [None]:
from sklearn.metrics import classification_report
knn=KNeighborsClassifier(n_neighbors=14)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=1) 
knn.fit(X_train,Y_train)
Y_predict3=knn.predict(X_test)
print(classification_report(Y_test,Y_predict3))

**Overall, with n_neighbors of 14, our KNN model gives the most accurate classifcation results.**