<font color = 'green'>
<h1>Breast Cancer Wisconsin (Diagnostic)<h1>

# Introduction
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
<br>
<br>
<font color = 'blue'>
<b>Content: </b>

1. [Prepare Problems](#1)
    * [Load Libraries](#2)
    * [Load Dataset](#3)    
1. [Descriptive Analysis](#4)
1. [EDA](#5)
1. [Missing Values](#6)
1. [Data Visualization](#7)
    * [Count Plot](#8)
    * [Pie Chart](#9)
    * [Distribution Plot](#10)
   
1. [Outlier Detection](#11)
    * [Let's The Outliers via Bubble Chart](#12)
1. [Drop Outliers](#13)
1. [Create Train and Test Dataset](#14)
1. [Standardization](#15)
1. [KNN Model](#16)
    * [KNN Tuning](#17)
    * [Make Prediction After Tuning](#18)

1. [Principal Component Analysis (PCA)](#19)
    * [Visualize Of New Dataframe](#20)
    * [Classification After PCA](#21)
1. [Neighborhood Components Analysis (NCA)](#22)
    * [Visualize Of New Dataframe](#23)
    * [Classification After NCA](#24)
1. [Compare Accuracies](#25)

<font color = 'red'>
<a id = "1"></a><br>
<h2>Prepare Problems<h2>
<font color = 'blue'>
      Predict whether the cancer is benign or malignant

<a id = "2"></a><br>
## Load Libraries

In [None]:
# Load Libraries:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.offline as pyo 
import plotly.graph_objs as go
import plotly.figure_factory as ff
from matplotlib.colors import ListedColormap
#
from sklearn.metrics import classification_report 
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 
#
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score
from sklearn import metrics
#
from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis, LocalOutlierFactor
from sklearn.decomposition import PCA
#
import warnings
warnings.filterwarnings("ignore")
#
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

<a id = "3"></a><br>
## Load Dataset

In [None]:
data = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")
data.head()

In [None]:
# Drop Unnecessary columns
data.drop(["Unnamed: 32","id"],axis=1,inplace=True)
data.head()

Attribute Information:

- 1) ID number
* 2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

*  radius (mean of distances from center to points on the perimeter)
*  texture (standard deviation of gray-scale values)
*  perimeter
*  area
*  smoothness (local variation in radius lengths)
*  compactness (perimeter^2 / area - 1.0)
*  concavity (severity of concave portions of the contour)
*  concave points (number of concave portions of the contour)
*  symmetry
*  fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

<a id = "4"></a><br>
## Descriptive Analysis

In [None]:
# data shape:
row, columns = data.shape
print("Data Row:", row)
print("Data Columns:", columns)
# column names:
data.columns
# descriptions 
display(data.describe().T)
# class distribution 
print("Data is  balanced:",data.groupby('diagnosis').size())

<a id = "5"></a><br>
## EDA

In [None]:
# correlation:
corr_matrix = data.corr()
sns.clustermap(corr_matrix,annot=True,fmt=".2f",figsize=(20,14))
plt.title("Correlation Between Features")

<a id = "6"></a><br>
## Missing Values

In [None]:
data.info()

<a id = "7"></a><br>
## Data Visualization

In [None]:
data_m = data[data.diagnosis == "M"]
data_b = data[data.diagnosis == "B"]

<a id = "8"></a><br>
## Count Plot

In [None]:
trace = [go.Bar(x=data.diagnosis.unique(), y=(len(data_m),len(data_b)),
               marker=dict(color=["blue","brown"]))]
               
layout = go.Layout(title="Count of M = malignant, B = benign ")# üst üste gelecek şekilde..
fig = go.Figure(data=trace,layout=layout)   
pyo.iplot(fig)

<a id = "9"></a><br>
## Pie Chart

In [None]:
labels = ["M","B"]
values = [len(data_m),len(data_b)]
trace = [go.Pie(labels=labels, values=values,
               marker=dict(colors=["blue","brown"]))]
layout = go.Layout(title="Percentage of M = malignant, B = benign ")
fig = go.Figure(data=trace,layout=layout)
pyo.iplot(fig)

<a id = "10"></a><br>
## Distribution Plot

In [None]:
def dist_plot(data_feature): 
    hist_data = [data_m[data_feature], data_b[data_feature]]
    
    group_labels = ['malignant', 'benign']
    colors=["blue","brown"]
    
    fig = ff.create_distplot(hist_data, group_labels, colors = colors)
    fig['layout'].update(title = data_feature)
    return pyo.iplot(fig)

### You can make more lots via dist_plot()

In [None]:
dist_plot('radius_mean')
dist_plot('texture_mean')

<a id = "11"></a><br>
## Outlier Detection

### Outlier detection with Local Outlier Factor (LOF)

The Local Outlier Factor (LOF) algorithm is an unsupervised anomaly detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outliers the samples that have a substantially lower density than their neighbors. This example shows how to use LOF for outlier detection which is the default use case of this estimator in scikit-learn. Note that when LOF is used for outlier detection it has no predict, decision_function and score_samples methods. See User Guide: for details on the difference between outlier detection and novelty detection and how to use LOF for novelty detection.

The number of neighbors considered (parameter n_neighbors) is typically set 1) greater than the minimum number of samples a cluster has to contain, so that other samples can be local outliers relative to this cluster, and 2) smaller than the maximum number of close by samples that can potentially be local outliers. In practice, such informations are generally not available, and taking n_neighbors=20 appears to work well in general.

To get : https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_lof_outlier_detection_001.png)

In [None]:
# Change object to integer:
data["diagnosis"] = [1 if item == "M" else 0  for item in data["diagnosis"]]

In [None]:
y = data["diagnosis"]
x = data.drop(["diagnosis"],axis=1)

In [None]:
columns = x.columns.tolist()

In [None]:
clf = LocalOutlierFactor()
y_pred = clf.fit_predict(x)

* property fit_predict :
* is_inlierarray, shape (n_samples,)
* Returns -1 for anomalies/outliers and 1 for inliers.

In [None]:
y_pred[:10]

In [None]:
X_score = clf.negative_outlier_factor_
outlier_score = pd.DataFrame()
outlier_score["score"] = X_score

In [None]:
outlier_score.head()

<a id = "12"></a><br>
## Let's The Outliers via Bubble Chart

In [None]:
# So make threshold: we decide about max and min of "outlier_score"
threshold = -2
filtre = outlier_score["score"] < threshold
outlier_index = outlier_score[filtre].index.tolist()

In [None]:
# Radius for our outliers
radius = (X_score.max()-X_score)/(X_score.max()-X_score.min())

In [None]:
trace0 = go.Scatter(x=x.iloc[outlier_index,0], y=x.iloc[outlier_index,1],
                   mode="markers",
                   marker=dict(size=10,color="brown"),
                   name="outliers"
                   )

trace1 = go.Scatter(x=x.iloc[:,0], y=x.iloc[:,1],
                   mode="markers",
                   marker=dict(size=50*radius,color="gold"),
                   name="real points"
                   )
 
layout = go.Layout(title="Outliers (Depends on Threshold Value)",hovermode="closest")
fig = go.Figure(data=[trace0,trace1],layout=layout)
pyo.iplot(fig)

<a id = "13"></a><br>
## Drop Outliers

In [None]:
x = x.drop(outlier_index)
y = y.drop(outlier_index)

<a id = "14"></a><br>
## Create Train and Test Dataset

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

<a id = "15"></a><br>
## Standardization

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(x_train)
X_test  = sc.transform(x_test) 

<a id = "16"></a><br>
## KNN Model

* Sentisitive for outliers
* It is problem on big data
* Curse of Dimensionality
* Feature Scaling
* It is problem on imbalance data
* Depends on K, model will check K nearst neighbour

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

In [None]:
knn_cm = confusion_matrix(y_test,y_pred)
knn_acc = metrics.accuracy_score(y_test, y_pred)
print(knn_cm)
print(knn_acc)

<a id = "17"></a><br>
## KNN Tuning

In [None]:
# Tuning Decision Tree Model
n_neighbors = [5,7,9,11,13,15,17,19,21]
weights = ["uniform","distance"]
metric = ["euclidean","manhattan","minkowski"]
param_grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)

In [None]:
knn = KNeighborsClassifier()
gs = GridSearchCV(estimator=knn,param_grid=param_grid,scoring="accuracy", cv=10)
grid_search = gs.fit(x_train,y_train)
best_score = grid_search.best_score_
best_parameters = grid_search.best_params_
print("Best Score:",best_score)
print("Best Parameters:",best_parameters)

<a id = "18"></a><br>
## Make Prediction After Tuning

In [None]:
knn = KNeighborsClassifier(metric='manhattan',n_neighbors=9,weights='distance')
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

In [None]:
knn_cm = confusion_matrix(y_test,y_pred)
knn_acc = metrics.accuracy_score(y_test, y_pred)
print(knn_cm)
print(knn_acc)

<a id = "19"></a><br>
## Principal Component Analysis (PCA)

In [None]:
data = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")
# Drop Unnecessary columns
data.drop(["Unnamed: 32","id"],axis=1,inplace=True)
# Change object to integer:
data["diagnosis"] = [1 if item == "M" else 0  for item in data["diagnosis"]]
y = data["diagnosis"]
x = data.drop(["diagnosis"],axis=1)

In [None]:
# PCA needs scaled data
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

In [None]:
# Build PCA
pca = PCA(n_components = 2)
pca.fit(x_scaled)
X_reduced_pca = pca.transform(x_scaled)

In [None]:
pca_data = pd.DataFrame(X_reduced_pca,columns=["p1","p2"])
pca_data["diagnosis"] = y

<a id = "20"></a><br>
## Visualize Of New Dataframe

In [None]:
hue =pca_data["diagnosis"]
data = [go.Scatter(x = pca_data.p1,
                   y = pca_data.p2,
                   mode = 'markers',
                   marker=dict(
                           size=12,
                           color=hue,
                           symbol="pentagon",
                           line=dict(width=2) #çevre çizgileri
                           ))]  
                            
layout = go.Layout(title="PCA",
                   xaxis=dict(title="p1"),
                   yaxis=dict(title="p2"),
                   hovermode="closest")
fig = go.Figure(data=data,layout=layout)   
pyo.iplot(fig)                

<a id = "21"></a><br>
## Classification After PCA

In [None]:
pca_data.head()

### Prepare X and Y

In [None]:
y_pca = pca_data.diagnosis
x_pca = pca_data.drop(["diagnosis"],axis=1)

In [None]:
x_train_pca, x_test_pca, y_train_pca, y_test_pca = train_test_split(x_pca, y_pca, test_size=0.33, random_state=42)

### KNN Model via PCA Features

In [None]:
knn_pca = KNeighborsClassifier()
knn_pca.fit(x_train_pca, y_train_pca)
y_pred_pca = knn_pca.predict(x_test_pca)

In [None]:
knn_cm_pca = confusion_matrix(y_test_pca,y_pred_pca)
knn_acc_pca = metrics.accuracy_score(y_test_pca, y_pred_pca)
print(knn_cm_pca)
print(knn_acc_pca)

## Let's Which Points are in the correct area 

In [None]:
# visualize 
cmap_light = ListedColormap(['orange',  'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'darkblue'])

h = .05 # step size in the mesh
X = x_pca
x_min, x_max = (X.iloc[:, 0].min() - 1), (X.iloc[:, 0].max() + 1)
y_min, y_max = X.iloc[:, 1].min() - 1, X.iloc[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = knn_pca.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(20, 10), dpi=80)
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=cmap_bold,
            edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

<a id = "22"></a><br>
## Neighborhood Components Analysis (NCA)

In [None]:
nca = NeighborhoodComponentsAnalysis(n_components = 2, random_state = 42)
nca.fit(x_scaled, y)
x_nca = nca.transform(x_scaled)
nca_data = pd.DataFrame(x_nca, columns = ["p1","p2"])
nca_data["diagnosis"] = y

<a id = "23"></a><br>
## Visualize Of New Dataframe

In [None]:
hue =nca_data["diagnosis"]
data_nca = [go.Scatter(x = nca_data.p1,
                   y = nca_data.p2,
                   mode = 'markers',
                   marker=dict(
                           size=7,
                           color=hue,
                           symbol="circle",
                           line=dict(width=2) 
                           ))]  
                            
layout = go.Layout(title="NCA",
                   xaxis=dict(title="p1"),
                   yaxis=dict(title="p2"),
                   hovermode="closest")
fig = go.Figure(data=data_nca,layout=layout)   
pyo.iplot(fig) 

<a id = "24"></a><br>
## Classification After NCA

In [None]:
y_nca = nca_data.diagnosis
x_nca = nca_data.drop(["diagnosis"],axis=1)

In [None]:
x_train_nca, x_test_nca, y_train_nca, y_test_nca = train_test_split(x_nca, y_nca, test_size=0.33, random_state=42)

### KNN Model via NCA Features

In [None]:
knn_nca = KNeighborsClassifier()
knn_nca.fit(x_train_nca, y_train_nca)
y_pred_nca = knn_nca.predict(x_test_nca)

In [None]:
knn_cm_nca = confusion_matrix(y_test_nca,y_pred_nca)
knn_acc_nca = metrics.accuracy_score(y_test_nca, y_pred_nca)
print(knn_cm_nca)
print(knn_acc_nca)

## Let's Which Points are in the correct area 

In [None]:
# visualize 
cmap_light = ListedColormap(['orange',  'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'darkblue'])

h = .3 # step size in the mesh
X = x_nca
x_min, x_max = (X.iloc[:, 0].min() - 1), (X.iloc[:, 0].max() + 1)
y_min, y_max = X.iloc[:, 1].min() - 1, X.iloc[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = knn_nca.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(figsize=(20, 10), dpi=80)
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, cmap=cmap_bold,
            edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

<a id = "25"></a><br>
## Compare Accuracies

In [None]:
models = ["Default","PCA","NCA"]
values = [0.946,0.952,0.984]

In [None]:
# Compare Model's Acc
f,ax = plt.subplots(figsize = (10,7))
sns.barplot(x=models, y=values,palette="viridis");
plt.title("Compare Accuracies",fontsize = 20,color='blue')
plt.xlabel('Analysis',fontsize = 15,color='blue')
plt.ylabel('Accuracies',fontsize = 15,color='blue')