# Introduction

Hello people, welcome to my kernel! In this kernel I am going to apply deep learning to breast cancer dataset. I am going to tell everything step by step. Before the start. Let's take a look at our schedule

# Schedule
1. Importing Libraries and Dataset
1. Dataset Overview
1. Simple Data Analyses
    * Diagnosis Countplot
    * Radius Mean Histogram
    * Texture Mean Histogram
    * Smoothness Mean Histogram
1. Outlier Detection
1. Detailed Data Analyses
    * Correlation Heatmap
    * Correlation Between Features
        * Radius Mean - Texture Mean Scatter Plot
        * Radius Mean - Smoothnes Mean Scatter Plot
1. Preprocessing
    * Dropping Unrelevant Features
    * Converting Label Feature Into Int64
    * Scaling (Normalizing)
    * Train Test Split
1. Modeling
    * Creating Model Function
    * Cross Validation
    * Fitting Model
    * Prediction and Result
1. Conclusion
    

# Importing Libraries and Dataset
In this section I am going to import libraries and dataset that I will use. However I am not going to import deep learning libraries and scikit-learn in this section, I am going to import them when I use them. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

"""
Data Manipulating
"""
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

"""
Visualization
"""
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

And now I am going to import dataset. Our dataset's format is csv, so I will use pandas' read_csv method for importing.

In [None]:
data = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')

# Dataset Overview

In this section I am going to examine the dataset.

In [None]:
data.head()

* There are two unrelevant features in the dataset: id, Unnamed: 32. We have to drop them
* There are 31 features in the dataset except the unrelevants.


In [None]:
data.info()

* Good news! Our dataset does not have any nan values, so we will not fill any nan values.

# Simple Data Analyses

In this section I am going to do some Simple EDA. But I can not examine all the features because there are 31 features. So I am going to examine some features.

## Diagnosis Countplot

In [None]:
fig,ax = plt.subplots(figsize=(8,6))
sns.countplot(data.diagnosis)
plt.show()

* As we can see there are two labels in our dataset. M and B
* Most of the dataset are B labeled.


## Radius Mean Histogram

In [None]:
fig,ax = plt.subplots(figsize=(8,6))
sns.distplot(data["radius_mean"],color="#FC2D2D")
plt.show()

* As we can see most of the dataset's radius_mean is between 10 and 20


## Texture Mean Histogram

In [None]:
fig,ax = plt.subplots(figsize=(8,6))
sns.distplot(data["texture_mean"],color="#F739F7")
plt.show()

* Texture mean plot is so similar with radius mean

## Smoothness Mean Histogram

In [None]:
fig,ax = plt.subplots(figsize=(8,6))
sns.distplot(data["smoothness_mean"],color="#F5BA5D")
plt.show()

* As we can see our smoothness_mean values are between 1 and 0.
* So it means that if we do not normalize our dataset, there will be problems.
* Most of the values is between 0.06 and 0.14

I think it is enough for having a bit more idea about the dataset. Let's move on to the next section.

# Outlier Detection

In this section I am going to drop outlier values. One day I've heard something about outlier values from a data scientist, he said outliers are silent killers, so you have to drop them from dataset. Yes, he is definitely right. 

But I am not going to drop the rows that only have one outlier value. I am going to drop the rows that have outlier values more than five. I have to do this because I do not want to drop so many rows.

Let's go!

In [None]:
def outlier_index_detector(df,features):
    indexes = []
    result = []
    for ftr in features:
        
        Q1 = df.describe()[ftr]["25%"] # Lower quartile
        Q3 = df.describe()[ftr]["75%"] # Upper quartile
        IQR = Q3 - Q1 # IQR
        STEP = IQR*1.5 # Outlier Step
        
        ind = data[(data[ftr]<Q1-STEP) | (data[ftr]>Q3+STEP)].index.values
        for i in ind:
            indexes.append(i)
    
    for index in indexes:
        
        indexes.remove(index) 
        if index in indexes: # More than 2
            indexes.remove(index)
            
            if index in indexes: # More than 3
                indexes.remove(index)
                
                if index in indexes: # More than 4
                    indexes.remove(index)
                    
                    if index in indexes: # Append Final Result
                        result.append(index)
            
    
    return result
    

### What Did I Do In This Function
1. I've started a for loop that each iteration is a different feature.
1. I've compute outlier step using outlier formula
1. I've find the indexes for each feature
    * For Instance:
        * Feature 1 Outlier Index : 32,12,42
        * Feature 2 Outlier Index : 32,16,89
    
    So we can say there are two outliers in the first row
1. I've append indexes into my list.
1. I've created a filter.
1. I've filtered rows that have outliers more than four.
1. I've append them into my result list.

* And now I am going to drop them for improving my future model.

In [None]:
feature_names = (list(data))
feature_names.remove("id")
feature_names.remove("diagnosis")
feature_names.remove("Unnamed: 32")
print(feature_names)

In [None]:
outliers = outlier_index_detector(data,feature_names)
print(outliers)

As you can see there are same values in list. So I am going to drop them.


In [None]:
outliers = list(np.unique(outliers))
print("There are {} outlier rows \n".format(len(outliers)))
print(outliers)

And we are ready to drop them


In [None]:
print("Len of the dataset before dropping outliers",len(data))
data.drop(outliers,inplace=True)
print("Len of the dataset after dropping outliers",len(data))

We completed this section. Let's move on!

# Detailed Data Analyses

In this section I am going to examine correlations between features. But I can not examine all correlations between features because you know there are 31 features in the dataset.

## Correlation Heatmap

First I am going to use correlation heatmap for diagnosing all the relations between the features.

In [None]:
fig,ax = plt.subplots(figsize=(20,20))
sns.heatmap(data.corr(),annot=True,linewidths=1.5,fmt="0.1f")
plt.show()

* What a confusing heatmap. 
* There are too many strong positive correlations between the features
* Unlike the positive correlations, there is not any strong negative correlation between the features.

## Correlation Between Features

In this sub-section I am going to use features that I've used in simple data analyses. Let's begin.


### Radius Mean - Texture Mean Scatter Plot
We've said that these two feature's histograms are similar. But It does not means they have strong correlation. Let's take a look at our scatter plot.

In [None]:
fig,ax = plt.subplots(figsize=(10,8))
sns.scatterplot(x="radius_mean",y="texture_mean",data=data,color="#670F91")
plt.show()

* As we can see there is a little correlation between them.


### Radius Mean - Smoothness Mean Scatter Plot

In [None]:
fig,ax = plt.subplots(figsize=(10,8))
sns.scatterplot(x="radius_mean",y="smoothness_mean",data=data,color="#BD6F4B")
plt.show()

* There is no correlation between these features. 

# Preprocessing

In this section I am going to prepare the dataset for modeling. I am going to follow these steps:
* Dropping Unrelevant Features
* Converting Label Feature Into Int64
* Scaling (Normalizing)
* Train Test Split

Let's start with dropping unrelevant features.


## Dropping Unrelevant Features
I know, this is easy, however I've said this in the beginning of the kernel. I am going to explain everything as much as I can step by step. 

In [None]:
data.drop(["Unnamed: 32","id"],axis=1,inplace=True)

* Axis 1 is columns (features) and axis 0 is rows (entries)

## Converting Label Feature Into Int64
In this section I am going to convert label (diagnosis) into int64. In order to do this I am going to use list comprehension from vanilla python.

     We love python <3

In [None]:
print("First 5 entries",data.diagnosis[:5])
data.diagnosis = [0 if each == "M" else 1 for each in data.diagnosis]
print(data.diagnosis[:5])

* Yea, our first five diagnosis is 0 but do not worry there are 1 values in the dataset. 

In [None]:
data.tail()

## Scaling
We are approaching to the most exciting section. In this section I am going to normalize dataset. In order to do this I am going to use this formula

     (value - min(data)) /( max(data) - min(data))

In [None]:
data = (data-np.min(data)) / (np.max(data)-np.min(data))
data.head()

## Train Test Split

In this section I am going to split the dataset into train and test. In order to do this I am going to use sklearn library's train_test_split function. 

In [None]:
from sklearn.model_selection import train_test_split
x = data.drop("diagnosis",axis=1) 
y = data.diagnosis

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

* Our arrays are ready. Let's check the lenghts of our arrays.


In [None]:
print("Len of the x_train",len(x_train))
print("Len of the x_test ",len(x_test))
print("Len of the y_train",len(y_train))
print("Len of the y_test ",len(y_test))

# Modeling

Finally we came! In this section I am going to create our deep learning model. I am going to follow these steps.
* Creating Model Function
* Cross Validation
* Fitting Model
* Prediction and Result

## Creating Model Function
In this section I am going to define model function. You know, I am going to use keras library and I have to define my model.

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from keras.layers import Dense

In [None]:
def build_classifier():
    classifier = Sequential()
    classifier.add(Dense(units=12,kernel_initializer="uniform",activation="tanh",input_dim=30))
    classifier.add(Dense(units=6,kernel_initializer="uniform",activation="tanh"))
    classifier.add(Dense(units=1,kernel_initializer="uniform",activation="sigmoid")) # Output Layer
    classifier.compile(optimizer="adam",loss="binary_crossentropy",metrics=["accuracy"])
    return classifier
                   

### What Did I Do In This Function
1. I've created sequential object (you can think it is empty model)
1. I've add my layers
    * Units = How many nodes in the layer
    * kernel_initializer= How do algorithm initalize weights and bias.
    * Activation = activation function like tanh,relu and sigmoid
1. I've compile my model.

### Why did I use Sigmoid Function In Output Layer?
Because in this kernel we will do binary classification (0 and 1) and we use sigmoid activation function for this.

I know there are questions in your mind, I would want to explain everything with more detail, but I am a beginner in deep learning and I am not very good at English, but if you want to learn details, you can check my teacher's kernel:

*https://www.kaggle.com/kanncaa1/deep-learning-tutorial-for-beginners*

## Cross Validation

In this section I am going to do cross validation and check the score. I am going to use sklearn library's cross_validation_score. But before this I am going to create my keras classifier

In [None]:
classifier = KerasClassifier(build_fn=build_classifier,epochs=100)
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator=classifier,X=x_train,y=y_train,cv=3)

print("Mean of CV scores",accuracies.mean())
print("Variance of CV scores",accuracies.std())


* Our cross validation score mean is %97.6
* Our cross validation variance is 0.01

## Fitting Model
In this section I am going to fit my model using my x_train and y_train arrays.

In [None]:
classifier.fit(x_train,y_train)

## Prediction and Result
In this section I am going to predict my test values and compute accuracy.

In [None]:
print("Our train score is",classifier.score(x_train,y_train))
print("Our test score is ",classifier.score(x_test,y_test))


# Conclusion

Thanks for your attention, if there are questions in your mind, you can ask them in comment section I will definitely answer them as much as I can. 

If you see any mistakes or problems in my kernel, please contact with me.