1. [Import Libraries and Data](#1)
    * 1.1 [Import Libraries](#1.1)
    * 1.2 [Load and Check Data](#1.2)
2. [Variable Description](#2)
    * 2.1 [Univariate Variable Analysis](#2.1)
    * 2.2 [Selected Numerical Variable](#2.2)
3. [Basic Data Analysis](#3)
4. [Outlier Detection](#4)
5. [Missing Value](#5)
    * 5.1 [Find Missing Value](#5.1)
6. [Visuzalization](#6)
    * 6.1 [Correlation](#6.1)
    * 6.2 [Radius Mean -- Diagnosis](#6.2)
    * 6.3 [Perimeter Mean -- Diagnosis](#6.3)
    * 6.4 [Area Mean -- Diagnosis](#6.4)
    * 6.5 [Concavity Points Mean -- Diagnosis](#6.5)  
7. [Modeling](#7)
    * 7.1 [Train Test Split](#7.1)
    * 7.2 [Scaling](#7.2)
    * 7.3 [Training](#7.3)
    * 7.4 [Hyperparameter Tuning -- Grid Search -- Cross Validation](#7.4)
    * 7.5 [Lets try with NN](#7.5)

<a id='1' a></r>
# 1. Import Libraries, Load and Check Data

<a id='1.1' a></r>
## 1.1 Import Libraries

In [None]:
import numpy as np
import pandas as pd 

import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import seaborn as sns
from collections import Counter


import warnings 
warnings.filterwarnings('ignore')

<a id='1.2' a></r>
## 1.2 Load and Check Data

In [None]:
df = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")

In [None]:
df.columns

In [None]:
df.describe().T

In [None]:
df.shape

<a id='2' a></r>
# 2. Variable Description
  

  
Attribute Information:

1) ID number  
2) Diagnosis (M = malignant, B = benign)  
3-32)  
  
Ten real-valued features are computed for each cell nucleus:  
  
a) radius (mean of distances from center to points on the perimeter)  
b) texture (standard deviation of gray-scale values)  
c) perimeter  
d) area  
e) smoothness (local variation in radius lengths)  
f) compactness (perimeter^2 / area - 1.0)  
g) concavity (severity of concave portions of the contour)  
h) concave points (number of concave portions of the contour)  
i) symmetry  
j) fractal dimension ("coastline approximation" - 1)  

In [None]:
df.info()

<a id='2.1' a></r>
## 2.1 Univariate Variable Analysis
**Categorical Variable** : diagnosis  
**Numerical Variable** : radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean,    
                      concave points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se,  
                      smoothness_se,compactness_se, concavity_se, concave points_se, symmetry_se, fractal_dimension_se,   radius_worst,
                      texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave                             points_worst, symmetry_worst, fractal_dimension_worst, id  

In [None]:
# Encode the categorical data values 'diagnosis'
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
df.iloc[:,1] = labelencoder_Y.fit_transform(df.iloc[:,1].values)

In [None]:
# Create a pair plot
sns.pairplot(df.iloc[:, 1:6], hue = 'diagnosis');

<a id='2.2' a></r>
## 2.2 Selected Numerical Variable

In [None]:
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    plt.hist(df[variable],bins = 10)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with histogram".format(variable))
    plt.show()

In [None]:
selected_numericalVar = ['diagnosis','radius_mean', 'perimeter_mean', 'area_mean','concavity_mean', "concave points_mean"]
for n in selected_numericalVar:
    plot_hist(n)

<a id='3' a></r>
# 3. Basic Data Analysis
* Radius Mean - Diagnosis
* Perimeter Mean - Diagnosis
* Area Mean - Diagnosis
* Concavity Mean - Diagnosis
* Concave Points Mean - Diagnosis

In [None]:
# Radius Mean - Diagnosis
df[['radius_mean', 'diagnosis']].groupby(['radius_mean'], as_index = False).mean().sort_values(by = 'diagnosis', ascending = False)

In [None]:
# Perimeter  Mean - Diagnosis
df[['perimeter_mean', 'diagnosis']].groupby(['perimeter_mean'], as_index = False).mean().sort_values(by = 'diagnosis', ascending = False)

In [None]:
# Area Mean - Diagnosis
df[['area_mean', 'diagnosis']].groupby(['area_mean'], as_index = False).mean().sort_values(by = 'diagnosis', ascending = False)

In [None]:
# Concavity Mean - Diagnosis
df[['concavity_mean', 'diagnosis']].groupby(['concavity_mean'], as_index = False).mean().sort_values(by = 'diagnosis', ascending = False)

In [None]:
# Concave Points Mean - Diagnosis
df[["concave points_mean", 'diagnosis']].groupby(["concave points_mean"], as_index = False).mean().sort_values(by = 'diagnosis', ascending = False)

<a id='4' a></r>
# 4.Outlier Detection

In [None]:
def detect_outlier(df, features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3-Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # Detect Outlier and Their Indices
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        # Store Indices
        outlier_indices.extend(outlier_list_col)
        
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [None]:
df.loc[detect_outlier(df, ['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
                           'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
                           'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
                           'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
                           'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
                           'fractal_dimension_se', 'radius_worst', 'texture_worst',
                           'perimeter_worst', 'area_worst', 'smoothness_worst',
                           'compactness_worst', 'concavity_worst', 'concave points_worst',
                           'symmetry_worst', 'fractal_dimension_worst'])]

* So we have 83 rows outlier variables, lets drop them

In [None]:
# drop outliers 
df= df.drop(detect_outlier(df, ['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
                           'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
                           'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
                           'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
                           'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
                           'fractal_dimension_se', 'radius_worst', 'texture_worst',
                           'perimeter_worst', 'area_worst', 'smoothness_worst',
                           'compactness_worst', 'concavity_worst', 'concave points_worst',
                           'symmetry_worst', 'fractal_dimension_worst']), axis = 0).reset_index(drop = True)

<a id='5' a></r>
# 5. Missing Value

<a id='5.1' a></r>
# 5.1 Find Missing Value

In [None]:
df.columns[df.isnull().any()]

In [None]:
df['Unnamed: 32']

* Just because we have an unnecessary column, I directly delete it

In [None]:
del df['Unnamed: 32']

<a id='6' a></r>
# 6. Visuzalization

<a id='6.1' a></r>
# 6.1 Correlation

In [None]:
# Let's create a colorful correlation matrix 
sns.heatmap(df.iloc[:,1:12].corr(), annot = True);
# as you can see, it is much better than the table
# but if you want you can use
# df.iloc[:,1:12].corr()

<a id='6.2' a></r>
# 6.2 Radius Mean -- Diagnosis

In [None]:
g = sns.FacetGrid(df, col = 'diagnosis')
g.map(sns.distplot, 'radius_mean', bins = 2)
plt.show()


<a id='6.3' a></r>
# 6.3 Perimeter Mean -- Diagnosis

In [None]:
g = sns.FacetGrid(df, col = 'diagnosis')
g.map(sns.distplot, 'perimeter_mean', bins = 5)
plt.show()


<a id='6.4' a></r>
# 6.4 Area Mean -- Diagnosis

In [None]:
g = sns.FacetGrid(df, col = 'diagnosis')
g.map(sns.distplot, 'area_mean', bins = 50)
plt.show()


<a id='6.5' a></r>
# 6.5 Concavity Mean -- Diagnosis

In [None]:
g = sns.FacetGrid(df, col = 'diagnosis')
g.map(sns.distplot, 'concavity_mean')
plt.show()


<a id='6.6' a></r>
# 6.5 Concavity Points Mean -- Diagnosis

In [None]:
g = sns.FacetGrid(df, col = 'diagnosis')
g.map(sns.distplot, "concave points_mean")
plt.show()


<a id='7' a></r>

# 7. Modeling

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

<a id='7.1' a></r>

# 7.1 Train Test Split

In [None]:
#  Split the data set into independent (X) and dependent(Y) data sets
X = df.iloc[:,2:31].values
y = df.iloc[:,1].values

In [None]:
# Split the data set into 67% training and 33% testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33, random_state = 123)


<a id='7.2' a></r>

# 7.2 Scaling

In [None]:
# Scale tge data (Feature scaling)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc  = sc.fit_transform(X_test)

#X_train

<a id='7.3' a></r>

# 7.3 Training

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
acc_log_train = round(logreg.score(X_train, y_train)*100,2) 
acc_log_test = round(logreg.score(X_test,y_test)*100,2)
print("Training Accuracy: % {}".format(acc_log_train))
print("Testing Accuracy: % {}".format(acc_log_test))

<a id='7.4' a></r>

## 7.4 Hyperparameter Tuning -- Grid Search -- Cross Validation
Compare 5 ML classifier and evaluate mean accuracy of each of them by stratified cross validation

* Decision Tree
* SVM
* Random Forest
* KNN
* Logistic Regression

In [None]:
random_state = 42
classifier = [DecisionTreeClassifier(random_state = random_state),
              SVC(random_state = random_state),
              RandomForestClassifier(random_state = random_state),
              LogisticRegression(random_state = random_state),
              KNeighborsClassifier()]

In [None]:
dt_param_grid = {'min_samples_split': range(10,500,20),
                'max_depth': range(1,20,2)}
svc_param_grid = {'kernel': ['rbf'],
                  'gamma' : [0.001, 0.01, 0.1, 1],
                  'C'     : [1,10,50,100,200,300,1000]}
rf_param_grid = {"max_features": [1,3,10],
                 "min_samples_split":[2,3,10],
                 "min_samples_leaf":[1,3,10],
                 "bootstrap":[False],
                 "n_estimators":[100,300],
                 "criterion":["gini"]}
logreg_param_grid = {'C'      : np.logspace(-3,3,7),
                     'penalty':['l1', 'l2']}
knn_param_grid = {"n_neighbors": np.linspace(1,19,10, dtype = int).tolist(),
                  "weights"    : ["uniform","distance"],
                  "metric"     :["euclidean","manhattan"]}
classifier_param = [dt_param_grid,
                    svc_param_grid,
                    rf_param_grid,
                    logreg_param_grid,
                    knn_param_grid]

In [None]:
cv_result = []
best_estimators = []
for i in range(len(classifier)):
    clf = GridSearchCV(classifier[i], param_grid=classifier_param[i], cv = StratifiedKFold(n_splits = 10), scoring = "accuracy", n_jobs = -1,verbose = 1)
    clf.fit(X_train,y_train)
    cv_result.append(clf.best_score_)
    best_estimators.append(clf.best_estimator_)
    print(cv_result[i])

In [None]:
cv_results = pd.DataFrame({"Cross Validation Means":cv_result,
                           "ML Models":["DecisionTreeClassifier", "SVM","RandomForestClassifier","LogisticRegression","KNeighborsClassifier"]})

g = sns.barplot("Cross Validation Means", "ML Models", data = cv_results)
g.set_xlabel("Mean Accuracy")
g.set_title("Cross Validation Scores")

<a id='7.5' a></r>

## 7.5 Lets try with NN

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import MinMaxScaler

In [None]:
df.head() # diagnosis part already encoded


In [None]:
X = df.iloc[:,2:31].values
y = df.iloc[:,1].values

min_max_scaler = MinMaxScaler()
X_scale = min_max_scaler.fit_transform(X)

In [None]:
# Split our data 80% training / 10% testing / 10% validation
X_train, X_val_and_test, y_train, y_val_and_test = train_test_split(X_scale, y, test_size = 0.2, random_state = 123)
# so let split val_and_test datas
X_val, X_test, y_val, y_test = train_test_split(X_val_and_test, y_val_and_test, test_size = 0.5, random_state = 123)
print("X_train shape :" ,X_train.shape, "X_val shape :", X_val.shape, "X_test shape :",X_test.shape)
print("y_train shape :" ,y_train.shape, "y_val shape :", y_val.shape, "y_test shape :",y_test.shape)

In [None]:
# Build the model and architecture of the deep neural network
model = Sequential() # innitializes the NN
model.add(Dense(units = 32, activation= 'relu',input_dim = 29))
model.add(Dense(units = 32, activation= 'relu'))
model.add(Dense(units = 32, activation= 'relu'))
model.add(Dense(units = 1, activation= 'sigmoid'))

In [None]:
# Loss function measures how well the model did on training and then tries to improve on it using optimizer
model.compile(optimizer='sgd',
              loss = 'binary_crossentropy',
              metrics = ['accuracy']
              )

In [None]:
# Train the model
hist = model.fit(
    X_train, y_train,
    batch_size = 32,
    epochs = 100,
    validation_data = (X_val, y_val)
)

In [None]:
model.evaluate(X_test, y_test)[1] # I want to see accuracy, thats the why I wrote [1]
# It says 1 accurate, It's perfect

In [None]:
# Make a prediction
prediction = model.predict(X_test)
prediction = [1 if y>=0.5 else 0 for y in prediction]
prediction

In [None]:
# visualize the training loss and validation loss to see if the model is over fitting
plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc = 'upper right');

In [None]:
# It seems not over fitted
hist.history['val_accuracy']

In [None]:
# visualize the training accuracy and validation accuracy to see if the model is over fitting
plt.plot(hist.history['accuracy'])
plt.plot(hist.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc = 'lower right');

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, prediction))
print(accuracy_score(y_test, prediction))