# Cihan Yatbaz
###  02 / 11 / 2018



1.  [Introduction:](#0)
2. [Exploratory Data Analysis (EDA) :](#1)
3. [Logistic Regression with Plot :](#2)
    1. [Preparing Dataset :](#3)
    2.  [Creating Parameters :](#4)
    3. [Forward and Backward Propagation  :](#5)
    4. [Updating Parameter :](#6)
    5. [Prediction Parameter :](#7)
    6. [ Logistic Regression :](#8)
4. [Logistec Regression with Sklearn  :](#9)
5. [CONCLUSION :](#10)

<a id="0"></a> <br>
## 1) Introduction

We will be working on this kernel Breast Cancer data. We'll introduce 80% of the cancer cells we have, and we will try to predict the remaining 20%. We will learn it whether they are 'benign' or 'malignant'. So let's start.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
# Prepare to data
data = pd.read_csv("../input/breast-cancer.csv")
data.head()

In [None]:
data.info()

In [None]:
# Let's wipe some columns that we won't use
data.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)  #axis=1 tüm sütunu siler
data.head()

<a id="1"></a> <br>
## 2) Exploratory Data Analysis (EDA)

In [None]:
data.describe()

In [None]:
# Let's take the some columns we'll use for show data means
data_mean= data[['diagnosis','radius_mean','texture_mean','perimeter_mean','area_mean',
                 'smoothness_mean','compactness_mean','concavity_mean','concave points_mean',
                 'symmetry_mean','fractal_dimension_mean']]

In [None]:
color_list = ['cyan' if i=='M' else 'orange' for i in data_mean.loc[:,'diagnosis']]
pd.plotting.scatter_matrix(data_mean.loc[:, data_mean.columns != 'diagnosis'],
                           c=color_list,
                           figsize= [15,15],
                           diagonal='hist',
                           alpha=0.5,
                           s = 200,
                           marker = '*',
                           edgecolor= "black")
                                        
plt.show()

In [None]:
# Values of 'Benign' and 'Malignant' cancer cells
sns.countplot(x="diagnosis", data=data)
data.loc[:,'diagnosis'].value_counts()

<a id="2"></a> <br>
# 3) Logistic Regression with Plot
We are organizing the data we will use first.

<a id="3"></a> <br>
### A) Preparing Dataset

In [None]:
# Let's convert "male" to 1, "female" to 0 values
data.diagnosis = [ 1 if each == "M" else 0 for each in data.diagnosis]
data.info()

In [None]:
# Let's determine the values of y and x axes
y = data.diagnosis.values
x_data = data.drop(["diagnosis"], axis=1)

In [None]:
# Now we are doing normalization. Because if some of our columns have very high values, they will suppress other columns and do not show much.
# Formulel : (x- min(x)) / (max(x) - min(x))
x = (x_data - np.min(x_data)) / (np.max(x_data) - np.min(x_data)).values
x.head()

In [None]:
# Now we reserve 80% of the values as 'train' and 20% as 'test'.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state=42)

# Here we will change the location of our samples and features. '(455,30) -> (30,455)' 
x_train = x_train.T   
x_test = x_test.T
y_train = y_train.T   
y_test = y_test.T

print("x_train :", x_train.shape)
print("x_test :", x_test.shape)
print("y_train :", y_train.shape)
print("y_test :", y_test.shape)

<a id="4"></a> <br>
### B) Creating Parameters

* Parameters are weight and bias.
* Weights: coefficients of each pixels
* Bias: intercept
* z = (w.t)x + b => z equals to (transpose of weights times input x) + bias
* In an other saying => z = b + px1w1 + px2w2 + ... + px4096*w4096
* y_head = sigmoid(z)
* Sigmoid function makes z between zero and one so that is probability.

In [None]:
# Now let's create the parameter and sigmoid function. Videodan nedenini yaz
def initialize_weights_and_bias(dimension):
    w = np.full((dimension,1),0.01)
    b = 0.0   # It will be float
    return w,b

# Sigmoid Function

# Let's calculating z
# z = np.dot(w.T,x_train)+b
def sigmoid(z):
    y_head = 1/(1+np.exp(-z)) # sigmoid functions finding formula
    return y_head
sigmoid(0)  # 0 should result in 0.5

<a id="5"></a> <br>
### C) Forward and Backward Propagation


Now if our cost will be error. we have to create backward propagation. Therefor let's make a backward propagation.

In [None]:
# In backward propagation we will use y_head that found in forward progation
# Therefore instead of writing backward propagation method, lets combine forward propagation and backward propagation

def forward_backward_propagation(w,b,x_train,y_train):
    
    # forward propagation
    z = np.dot(w.T,x_train)+b
    y_head = sigmoid(z)
    loss = -y_train*np.log(y_head)-(1-y_train)*np.log(1-y_head)  
    cost =(np.sum(loss))/x_train.shape[1]         # x_train.shape[1] for scaling
    
    # backward propagation
    derivative_weight = (np.dot(x_train,((y_head-y_train).T)))/x_train.shape[1]
    derivative_bias = np.sum(y_head-y_train)/x_train.shape[1] 
    gradients = {"derivative_weight": derivative_weight,"derivative_bias": derivative_bias}
    
    return cost,gradients

<a id="6"></a> <br>
### D) Updating Parameter

In [None]:
# Now let's apply Updating Parameter

def update(w, b, x_train, y_train, learning_rate, number_of_iteration):
    cost_list = []
    cost_list2 = []
    index = []
    # Updating(learning) parameters is number_of_iteration times
    for i in range(number_of_iteration):
        # make forward and backward propagation and find cost gradients
        cost,gradients = forward_backward_propagation(w,b,x_train,y_train)
        cost_list.append(cost)
        # lets update
        w = w - learning_rate * gradients["derivative_weight"]
        b = b - learning_rate * gradients["derivative_bias"]
        if i % 10 == 0:
            cost_list2.append(cost)
            index.append(i)
            print ("Cost after iteration %i: %f" %(i, cost))
        
        # we update(learn) parameters weights and bias
    parameters = {"weight":w, "bias":b}
    plt.plot(index, cost_list2)
    plt.xticks(index, rotation='vertical')
    plt.xlabel("Number of iteration")
    plt.ylabel("cost")
    plt.show()
    return parameters, gradients, cost_list


<a id="7"></a> <br>
### E) Prediction Parameter

In prediction step we have x_test as a input and while using it, we make forward prediction. 

In [None]:
# Let's create prediction parameter
def predict(w,b,x_test):
    # x_test is an input for forward propagation
    z = sigmoid(np.dot(w.T,x_test)+b)
    y_prediction = np.zeros((1,x_test.shape[1]))
    # if z is bigger than 0.5, our prediction is sign one (y_head=1),
    # if z is smaller than 0.5, our prediction is sign zero (y_head=0),
    for i in range(z.shape[1]):
        if z[0,i]<= 0.5:
            y_prediction[0,i] = 0
        else:
            y_prediction[0,i] = 1

    return y_prediction


<a id="8"></a> <br>
### F) Logistic Regression

Now lets put them all together.

In [None]:
#Logistic Regression

def logistic_regression(x_train, y_train, x_test, y_test, learning_rate ,  num_iterations):
    # initialize
    dimension =  x_train.shape[0]  # that is 455
    w,b = initialize_weights_and_bias(dimension)
    # do not change learning rate
    parameters, gradients, cost_list = update(w, b, x_train, y_train, learning_rate,num_iterations)
    
    y_prediction_test = predict(parameters["weight"],parameters["bias"],x_test)

    # Print train/test Errors
    print("test accuracy: {} %".format(100 - np.mean(np.abs(y_prediction_test - y_test)) * 100))
    
logistic_regression(x_train, y_train, x_test, y_test,learning_rate = 0.01, num_iterations = 100)

In [None]:
# We can increase the accuracy of the test by playing with learning_rate and num_iterations
logistic_regression(x_train, y_train, x_test, y_test,learning_rate = 5, num_iterations = 150)

<a id="9"></a> <br>
# 4) Logistec Regression with Sklearn
With the Sklearn library, we can find the result you found above in a much easier way.

In [None]:
from sklearn import linear_model
lgrg = linear_model.LogisticRegression(random_state=42, max_iter=150)

print("test accuracy: {} ".format(lgrg.fit(x_train.T, y_train.T).score(x_test.T, y_test.T)))

<a id="10"></a> <br>
> # CONCLUSION                                                                                                                                                      
Thank you for your votes and comments                                                                                                                                              
<br>**If you have any suggest, May you write for me, I will be happy to hear it.**