# Machine Learning Tutorial w/ Pima Indians Diabates Database 

<font color ="green">
### Content:
1. [Load and Check Data](#1)
2. [Basic Data Analysis](#2)
3. [Visualisation of the Data](#3)
4. [Machine Learning Algorithms](#4)
    * [Supervised Learning](#5) 
        * [Regression](#6) 
            1. [Linear Regression](#7)
            1. [Polynomial Linear Regression](#8) 
            1. [Decision Tree Regression](#9)
            1. [Random Forest Regression](#10)
            
        * [Classification](#11) 
            1. [Logistic Regression Classification](#12)
            1. [K-Nearest Neighbour (KNN) Classification](#13)
            1. [Support Vector Machine (SVM) Classification](#14)
            1. [Naive Bayes Classification](#15)
            1. [Decision Tree Classification](#16)
            1. [Random Forest Classification](#17)
                
                - [Perfomance Comparison of Classification Methods](#18)
       
       * [Evaluation Classification Method](#19) 
            1. [Confusion Matrix](#19)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

<a id ='1'></a>
## Loading and Check Data

In [None]:
#Loading Data
data = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")

<a id ='2'></a>
## Basic Data Analysis

In [None]:
#Let's see the columns
data.columns

In [None]:
# then we need to see info
data.info()

In [None]:
data.head()

Outcome is 1, means that patience is sick 0 is healthy


<a id ='3'></a>
## Visualisation of the Data
- pd.plotting.scatter_matrix:
- 
    - green: healthy
    - red: sick
    - c: color
    - figsize: figure size
    - diagonal: histohram of each features
    - alpha: opacity
    - s: size of marker
    - marker: marker type

In [None]:
#Let's Visualize the Data 
color_list = ["red" if i == 1 else "green" for i in data.loc[:,"Outcome"]]
pd.plotting.scatter_matrix(data.loc[:,data.columns !="Outcome"],
                          c=color_list,
                          figsize = [20,20],
                          diagonal ="hist",
                          alpha = 0.6,
                          s=200,
                          marker ="*",
                          edgecolor = "black")
plt.show()

In [None]:
#To see the distrubution of the outcome we'll use seaborn sns.countplot
import seaborn as sns
sns.countplot(x="Outcome", data = data)
data.loc[:,"Outcome"].value_counts()


As we see there are 500 healthy and 268 sick people

<a id ='4'></a>
## Machine Learning 

<a id ='5'></a>
### ***Supervised Learning Algorithms***

### Train and Test Data

First of all, we need to split our data for training and testing 

In [None]:
y = data.Outcome.values
x_data = data.iloc[:,:-1]
x = (x_data - np.min(x_data))/(np.max(x_data)-np.min(x_data))

from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.30, random_state = 42) #Yüzde 25 i x_test vey_test e atanacak 

x_train = x_train.T
y_train = y_train.T
x_test = x_test.T
y_test = y_test.T

print("x_train shape",x_train.shape)
print("y_train shape",y_train.shape)
print("x_test shape",x_test.shape)
print("y_test shape",y_test.shape)

### Inıtializing Parameters and Sigmoid Function



In [None]:
#%% Initializing Parameters and Sigmoid Function 

def initialize_weights_and_bias(dimension): #30 feature var o zaman 30 dimension olmalı
    
    w = np.full((dimension,1),0.01) #burada dimension 30 girdiğimiz zaman [0,0.01] lik weightler atayacagız
    b = 0.0 #float olsun diye 0.0 yazdım
    return w,b
# w,b = initialize_weight_and_bias(30)

def sigmoid(z):
    y_head = 1/(1+ np.exp(-z)) #formülü budur z nin
    return y_head

#sigmoid(0) değeri 0.5 vermelidir 
    

### Forward and Backward Propagation

In [None]:
#%% Forward - Backward Propagation
#Bu kısımda w ile train data mızı çarpacağız Bias ekleyip sigmoid fonksiyona sokacağız 

def forward_backward_propagation(w,b,x_train,y_train):
    #Forward Prop 
    z = np.dot(w.T,x_train) + b #Transpoz alma sebebimiz matris carpımını yapabilmek için 
    y_head = sigmoid(z) #Sigmoid fonksiyonuna soktuk
    loss = -y_train*np.log(y_head)-(1-y_train)*np.log(1-y_head) #Loss fonksiyonunu yazdık 
    cost = (np.sum(loss)) / x_train.shape[1] #Losslar toplamını normalize etmek için sample sayısına böldük 
    #x_train_shape[1] = 455
    

    #Backward Prop
    derivative_weight = (np.dot(x_train,((y_head-y_train).T)))/x_train.shape[1] #Formül bu, shape bölmek normalize etmek için
    derivative_bias = np.sum(y_head-y_train)/x_train.shape[1]
    gradients = {"derivative_weight": derivative_weight, "derivative_bias": derivative_bias}
    
    return cost,gradients

### Updating Parameters

In [None]:
#%% Updating Parameters 

def update(w, b, x_train, y_train, learning_rate,number_of_iterarion):
    cost_list = []
    cost_list2 = []
    index = []
    
    #Iteration 
    for i in range(number_of_iterarion):
        #Doing forward and Backward Propagation 
        cost,gradients = forward_backward_propagation(w,b,x_train,y_train)
        cost_list.append(cost) #Güncelleme öncesi cost list e atıyorum (Tüm cost listleri depolamak)
        
        #Updating 
        w = w - learning_rate * gradients["derivative_weight"]
        b = b - learning_rate * gradients["derivative_bias"]
    
        if i %10 == 0:
            cost_list2.append(cost) #Her 10 adımda bir costları depola    
            index.append(i)
            print("Cost after iteration %i: %f"%(i,cost))
            
    #Number of iteration kaç olacagı kararını deneyerek bulacagız Türevi 0 a yaklaşınca yeterli olacaktır
    #We updaate (learn) parameters weights and Bias 
    parameters = {"weight":w , "bias":b}
    plt.plot(index,cost_list2)
    plt.xticks(index,rotation ='vertical')
    plt.xlabel("Number of Iteration")
    plt.ylabel("Cost")
    plt.show()
    return parameters,gradients,cost_list

### Prediction 

In [None]:
#%% Prediction 
def predict(w,b,x_test): #w,b zaten lazım ama x_test de class ı belli olmayan ve test edeceğim (tahmin edeceğim) data 
    z = sigmoid(np.dot(w.T,x_test)+b)
    y_prediction = np.zeros((1,x_test.shape[1]))
    
    #Eger z 0.5 den büyük ise y_head = 1 yani kötü huylu
    #Eger < 0.5 ise y_head = 0 yani iyi huylu 
    
    for i in range(z.shape[1]):
        if z[0,i]<=0.5:
            y_prediction[0,i] = 0
        else:
            y_prediction[0,i] = 1
            
    return y_prediction
        
#şimdi y prediction u y test ile karşılastırıp eğitimin dogruluguna bakıcaz 
    

### Logistic Regression 


In [None]:
def logistic_regression(x_train, y_train, x_test, y_test, learning_rate ,  num_iterations):
    # initialize
    dimension =  x_train.shape[0]  # that is 30
    w,b = initialize_weights_and_bias(dimension)
    # do not change learning rate
    parameters, gradients, cost_list = update(w, b, x_train, y_train, learning_rate,num_iterations)
    
    y_prediction_test = predict(parameters["weight"],parameters["bias"],x_test)

    # Print test Errors
    print("test accuracy: {} %".format(100 - np.mean(np.abs(y_prediction_test - y_test)) * 100))
    
logistic_regression(x_train, y_train, x_test, y_test,learning_rate = 1, num_iterations = 400) 

## **By Using SKLEARN Library**
- In this section we'll use just sklearn library to find exact values for all the process (regression, classification models) 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
#Loading Data
data = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")

<a id ='6'></a>
## Regression 
- In this section we'll learn also apply the Regression Methods such as:
    * Linear Regression (y=b0 + b1x1)
    * Polynomial Linear Regression (y=b0 + b1x1 + b2*x^2 + ... + bn*x^n)
    * Decision Tree Regression 
    * Random Forest Regression 
    * EXTRA = Performance Analysis by using R-Square Method

We use Regression Models to predict the future values by using data
For Example, we try to predict a house value,price in the California by using real data California house prices. 

<a id ='7'></a>
### 1-) Linear Regression 
- y=b0 + b1x1

In [None]:
data0 = data[data.Outcome == 0] # Healthy group
data1 = data[data.Outcome == 1] # Sick group
data1.sort_values(by="Age")

In [None]:
#We will use BloodPressure and Age parameters in the Sick group
data1 = data[data.Outcome == 1]
xlin=data1.BloodPressure.values.reshape(-1,1)
ylin=data1.Age.values.reshape(-1,1)

#Linear Regression Model
from sklearn.linear_model import LinearRegression
linear_reg = LinearRegression()
#Creating Prediction Space to get more efficient results
predict_space = np.linspace(min(xlin),max(xlin)).reshape(-1,1)  
#Fit
linear_reg.fit(xlin,ylin)
#Prediction 
predicted = linear_reg.predict(predict_space)
#Perfomance Analysis w/R^2 Score method
print("R^2 Score is :",linear_reg.score(xlin,ylin))

plt.plot(predict_space, predicted, color="black",linewidth=2)
plt.scatter(x=xlin, y=ylin)
plt.xlabel("Blood Pressure")
plt.ylabel("Age")
plt.show()

- **As you see our R^2 Score is too bad because this features are not able to use Linear Regression efficiently**
- **But ı wanted to add this table here as an example** 


<a id ='8'></a>
### 2-) Polynomial Linear Regression
- y=b0 + b1x1 + b2*x^2 + ... + bn*x^n 
- This method kind a complex of Polynomial and Linear Regression thats why in this method's solution there are 2 steps and libraries

In [None]:
#We will use Glucose and Age parameters in the Sick group
data1 = data[data.Outcome == 1]
xpol=data1.Glucose.values.reshape(-1,1) #In Sklearn we need to reshape our data like that
ypol=data1.Age.values.reshape(-1,1)

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 3) #degree = 3 means we have limited the equation with x^3
x_polynomial = poly_reg.fit_transform(xpol) #We transformed our xpol values to x^3
#Fit (For fitting we use Linear Regression again..)
linear_reg2 = LinearRegression()  
linear_reg2.fit(x_polynomial,ypol)
#Prediction 
y_head = linear_reg2.predict(x_polynomial)
#Visualisation 
plt.plot(x_polynomial,y_head,color="red")
plt.show()

<a id ='9'></a>
### 3-) Decision Tree Regression

In [None]:
#We will use BloodPressure and Age parameters in the Sick group
data1 = data[data.Outcome == 1]
xdt=data1.BloodPressure.values.reshape(-1,1)
ydt=data1.Age.values.reshape(-1,1)

from sklearn.tree import DecisionTreeRegressor
dtreg = DecisionTreeRegressor()
#Fit
dtreg.fit(xdt,ydt) 
#Prediction space
xdt_ = np.arange(min(xdt),max(xdt),0.01).reshape(-1,1)
y_headdt = dtreg.predict(xdt_)
#Visualisation 
plt.scatter(xdt,ydt,color ="red",label="Values")
plt.plot(xdt_,y_headdt,color="blue",label="Predicted")
plt.show()

<a id ='10'></a>
### 4-) Random Forest Regression 
- Random Forest Regression is a complex form of Decision Tree Regression that we work with many of Decision Tree Regression model 


In [None]:
#Let's work same features in the Decision Tree Regression model that BloodPressure and Age 
data1 = data[data.Outcome == 1]
xrf=data1.BloodPressure.values.reshape(-1,1)
yrf=data1.Age.values.reshape(-1,1)

from sklearn.ensemble import RandomForestRegressor  
rfreg = RandomForestRegressor(n_estimators = 100, #We work with 100 times decision tree reg
                             random_state= 42)
#Fit
rfreg.fit(xrf,yrf) 
#Prediction space
xrf_ = np.arange(min(xrf),max(xrf),0.01).reshape(-1,1)
y_headrf = rfreg.predict(xdt_)
#Visualisation 
plt.scatter(xrf,yrf,color ="red",label="Values")
plt.plot(xrf_,y_headrf,color="blue",label="Predicted")
plt.show()

<a id ='11'></a>
## Classification
- In this section we'll learn and apply all the Classification Methods such as:
    + Logistic Regression Classification
    + K-Nearest Neighbour (KNN) Classification
    + Support Vector Machine (SVM) Classification
    + Naive Bayes Classification 
    + Decision Tree Classification 
    + Random Forest Classification
    EXTRA : Evaluation Classification Methods (Alternative to score(x_test,y_test) method)
        + Confusion Matrix

In [None]:
y = data.Outcome.values
x_data = data.iloc[:,:-1]
x = (x_data - np.min(x_data))/(np.max(x_data)-np.min(x_data))

from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.30, random_state = 42) #Yüzde 25 i x_test vey_test e atanacak 


print("x_train shape",x_train.shape)
print("y_train shape",y_train.shape)
print("x_test shape",x_test.shape)
print("y_test shape",y_test.shape)

#When we need to do any process without any mistake, error arrays should be like that 
# 537,8 
# 537,
# 231,8
# 231

<a id ='12'></a>
### Logistic Regression w/Sklearn Library


In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
print("Test Accuracy : %{}".format(lr.score(x_test,y_test)*100))


### We trained the data with Logistic Regression Model and the model will predict the values truely by %74.4 ratio

<a id="13"></a>
### K - Nearest Neighbour (KNN) Classification


In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 40) #n_neighbors is a hyperparameter that's why we need to try to examine the Optimum value 
knn.fit(x_train,y_train)
prediction = knn.predict(x_test)
# print("Prediction:",prediction) if you want to compare the test data and predictions you can remove # and try 
print("for n={} KNN Score : {}".format(40,knn.score(x_test,y_test))) 


### Alternative 1 - Find best n number to get optimum value

In [None]:
#if we want to see which n number will be optimum we can define a for loop for that 
score_list = []
for each in range(1,50):
    knn2 = KNeighborsClassifier(n_neighbors = each)
    knn2.fit(x_train, y_train)
    score_list.append(knn2.score(x_test, y_test))

plt.figure(figsize=(8,5))
plt.scatter(range(1,50),score_list)
plt.xlabel("k values")
plt.ylabel("accuracy")
plt.show() 

#40 might be the optimum number for N 
#The answer is 0.753

### Alternative 2 - Find best n number to get optimum value ( More Beneficial )


In [None]:
neig = np.arange(1,50)
train_accuracy = []
test_accuracy = []
#Loop all over in k values
for i,k in enumerate(neig):
    knn3 = KNeighborsClassifier(n_neighbors = k)
    #Fit process
    knn3.fit(x_train,y_train)
    #Train Accuracy
    train_accuracy.append(knn3.score(x_train,y_train))
    #Test Accuracy
    test_accuracy.append(knn3.score(x_test,y_test))
    
#Plotting the Values 
plt.figure(figsize =(13,10))
plt.plot(neig, test_accuracy, label = "Testing Accuracy")
plt.plot(neig, train_accuracy, label = "Training Accuracy")
plt.legend()
plt.title("Values vs Accuracy")
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.xticks(neig) #We limit the Max min values in the plot axis according to Number of max neighbor
plt.savefig("graph.png")
plt.show()
print("Best Accuracy : {} with K : {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))

<a id="14"></a>
### Support Vector Machine (SVM) Classification 
- This method provides that the optimum line which seperates the "2" classes objects 

In [None]:
from sklearn.svm import SVC 
svm = SVC(random_state = 1)
svm.fit(x_train,y_train)

#Test 
print("Accuracy of the SVM Algorithm : ",svm.score(x_test,y_test))

<a id="15"></a>
### Naive Bayes Classification 

In [None]:
from sklearn.naive_bayes import GaussianNB 
nb = GaussianNB()
nb.fit(x_train,y_train)

print("Accuracy of the Naive Bayes Algorithm :",nb.score(x_test,y_test))

<a id="16"></a>
### Decision Tree Classification


In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)
print("Accuracy of the Decision Tree :",dt.score(x_test,y_test))

<a id="17"></a>
### Random Forest Classification 
- This method is a advanced version of Decision Tree Classification Method 
- In this method we use many of the Decision Tree algorithm but this number is a hyperparameter and we need to find a exact number for optimum solution
- Obvious that **RF Classification Accuracy must better than Decision Tree C.**

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators= 24, random_state=42)
rf.fit(x_train,y_train)
print("Accuracy of the Random Forest Classification : ",rf.score(x_test,y_test))

#We need to find optimum value that's why need to decide best number for n_estimators parameter
#if we want to see which n number will be optimum we can define a for loop for that 
score_list2 = []
for each in range(1,200):
    rf2 = RandomForestClassifier(n_estimators = each)
    rf2.fit(x_train, y_train)
    score_list2.append(rf2.score(x_test, y_test))

plt.figure(figsize=(8,5))
plt.scatter(range(1,200),score_list2)
plt.xlabel("n values")
plt.ylabel("accuracy")
plt.show() 

In [None]:
aa = np.max(score_list2) #We can see the max value would be 0.7878 and n = 24 might be the great option
aa

<a id="18"></a>
## Performans Comparison of Classification Methods
 ****Finally Let's check all of the classification methods' accuracy results****

In [None]:
print("Test Accuracy for Logistic Regression: %{}".format(lr.score(x_test,y_test)*100))
print("for n={} KNN Score : %{}".format(40,knn.score(x_test,y_test)*100))
print("Accuracy of the SVM Algorithm : %{}".format(svm.score(x_test,y_test)*100))
print("Accuracy of the Naive Bayes Algorithm : %{}".format(nb.score(x_test,y_test)*100))
print("Accuracy of the Decision Tree : %{}".format(dt.score(x_test,y_test)*100))
print("Accuracy of the Random Forest Classification : %{}".format(rf.score(x_test,y_test)*100))

### As you see for classification best results given from Random Forest Classification by difference from the nearest oppenent %2

<a id="19"></a>
## Evaluation Classification Methods 
  ### Confusion Matrix 
  - In this section we'll learn the accuracy test method alternative to .score(x_test,y_test)) (**Actually better way to examine details**)

In [None]:
# I'm gonna show an example on Random Forest C. model 
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators= 24, random_state=42)
rf.fit(x_train,y_train)
print("Accuracy of the Random Forest Classification : ",rf.score(x_test,y_test))

#In this method we need to predict x test values 
y_pred = rf.predict(x_test)
y_true = y_test

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true,y_pred)

#Lets Visualize it 
import seaborn as sns 
f,ax =plt.subplots(figsize=(6,6))
sns.heatmap(cm, annot = True, linecolor = "blue", ax = ax)

plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()


- *First square says that the model predicts value 0 (healthy) as value 0 for 13*e^02 values (correct)
- *Second sqare says that the model predict  value 0 (healthy) as value 1 for 24 values (incorrect)
- *Third square says that the model predicts value 1 (sick) as value 0 for 29 values (incorrect)
- *Forth square says that the model predicts value 1 (sick) as value 1 for 51 values (correct)

- So the model predicts wrong values for 29 + 24 = 53 times 