****CONTENT****

* [Decision Tree for Classification](#1)
* [Decision Tree for Regression](#2)
* [Generalization Error](#3)
    * [Diagnose Bias and Variance Problems](#4)
        * [Cross Validation (CV)](#5)
            * [K-Fold CV](#6)
* [Ensemble Learning](#7)
    * [Voting Classifier](#8)
    * [Bagging (Bootstrap Aggregation) - Classifier and Regressor](#9)
        * [Out Of Bag (OOB) Evaluation](#10)
    * [Random Forests - Classifier and Regressor](#11)
    * [Boosting](#12)
        * [Adaboost (Adaptive Boosting) - Classifier and Regressor](#13)
        * [Gradient Boosting - Classifier and Regressor](#14)
        * [Stochastic Gradient Boosting (SGB) - Classifier and Regressor](#15)
* [Tuning a CART's Hyperparameters](#16)
    * [Grid Search Cross Validation for DecisionTreeClassifier](#17)
    * [Grid Search Cross Validation for RandomForestRegressor](#18)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

<a id="1"></a> <br>
**Decision Tree for Classification**

Labeled bir datasetteki samplelara if-else questions sorarak label'larını çıkarmaya yarar.

Bunu, feature'ları x-y koordinatı gibi düşün, bölgelere ayırır ve bu bölgelerdeki label'lar da zaten bellidir.

Bu decision regionlar işte decision tree if-else sorularına göre belirlenir. 

linear modelin aksine, featurelar ve labellar arasındaki nonlinear ilişkiyle alakalı çıkarım yapabilir treeler

Scaling yapılmasına gerek yoktur. 

In [None]:
df = pd.read_csv('../input/iris-flower-dataset/IRIS.csv')

In [None]:
df.species = [2 if i == 'Iris-setosa' else 1 if i == 'Iris-versicolor' else 0 for i in df.species]

In [None]:
df.head()

In [None]:
X = df.drop('species', axis=1).values  
y = df['species'].values

In [None]:
#Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

#Import train_test_split
from sklearn.model_selection import train_test_split

#Import accuracy_score
from sklearn.metrics import accuracy_score

In [None]:
#Split dataset into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1) 
#stratify=y: train and test sets have the same proportion of class labels as the unsplit dataset

In [None]:
#Instantiate DecisionTree
dt = DecisionTreeClassifier(max_depth=2, random_state=1)

#random_state=1 for reproducability
#max_depth=2 2 seviye iniyor tree

#criterion='entropy' parametresi ile, decision region belirlemek için split yaparken hangi metodu kullanacağımızı seçiyoruz. default'u 'gini' dir.

In [None]:
#Fit dt to the training set
dt.fit(X_train, y_train)

#Predict test set labels
y_pred =dt.predict(X_test)

#Evaluate test-set accuracy
accuracy_score(y_test, y_pred)

<a id="2"></a> <br>
**Decision Tree for Regression**

Regression'da target variable continuous. Bunu biliyoruz.

Classification'da yaptığımız gibi, x1 ve x2 iki feature olsun, ve bu değerlere göre grafiğe döktükten sonra 
split ede ede (if-else'e göre) decision region'lar bulup, bu bölgelerde yer alan data point'lerin 
target(train data set olduğu için biliyoruz bu değerleri) değerlerinin ortalamasını alıyoruz ve bu değerler bizim
modelimizin sonuçlarını veriyor.




In [None]:
df = pd.read_csv('../input/autompg-dataset/auto-mpg.csv')
df.drop(['car name', 'cylinders', 'model year'], axis=1, inplace=True)
df.replace('?','0', inplace=True)
df.horsepower = df.horsepower.astype('float')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
plt.scatter(df.mpg, df.displacement) 

#aşağıdaki gibi nonlinear bir grafiği linear modeller ile çözümleyemeyiz.

In [None]:
X = df.drop('mpg', axis=1).values  
y = df['mpg'].values

In [None]:
#Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor

#Import train_test_split
from sklearn.model_selection import train_test_split

#Import accuracy_score
from sklearn.metrics import mean_squared_error as MSE

In [None]:
#Split dataset into 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3) 

In [None]:
#Instantiate DecisionTreeRegressor
dt = DecisionTreeRegressor(max_depth=4,min_samples_leaf=0.1, random_state=3)

#random_state=3 for reproducability
#max_depth=4; 4 seviye iniyor tree
#min_samples_leaf: leaf dediğimiz şey decision regionların herbiri. 
#--bu parametre de, herbir leaf'e train datanın minimum 0.1'i gelecek diyor. 0.1'in altına düştüğünde duruyor. 

In [None]:
#Fit dt to the training set
dt.fit(X_train, y_train)

#Predict test set labels
y_pred =dt.predict(X_test)

#Compute test-set MSE
mse_dt = MSE(y_test, y_pred)

#Compute test-set RMSE
rmse_dt = mse_dt**(1/2)

#Print rmse_dt
print(rmse_dt) #bu performans ölçütünü bir de linear regression için yapıp sonuçlar arası farkı görebiliriz.

<a id="3"></a> <br>
**Generalization Error**

Supervised learning'de, features ve labels'ın bir fonksyonla birbirine bağlı olduğu varsayımını yaparız: y = f(x) and f is unknown 
Amaç, bu x ile en iyi fit eden f fonksiyonunu bulmaktır.

Bunu yaparken, Generalization Error'ı minimize edecek şekilde yaparız.

Generalization Error = Bias (accuracy ile ilgili) + Variance (precision ile ilgili) + Irreducible Error(constant)

Bias: modelin bulduğu f'in gerçek datadan ne kadar saptığı. high bias -->> underfitting demektir.
       
       model complexity arttıkça bias azalır.

Variance: modelin bulduğu f farklı training datasetlerle ne kadar inconsistent. high variance -->> overfitting

       model complexity arttıkça variance artar.

yani, bias azalırken variance artar. 

arada bir denge vardır, ve bu dengede GE minimumdur.

buna Bias-Variance Trade-Off denir. İkisinin de en az olmasını sağlayamayabiliriz ama optimize edebiliriz.

Amaç, overfitting(f fits the noise in the training set)-underfitting(f is not flexible enough to approximate) sorunları ile karşılaşmayacağımız best complexity of the model'a ulaşmaktır.

Model overfit ediyorsa training seti, unseen dataset'i predict etme gücü zayıf olur. 
Aslında, training set için low error verecek şekilde bir model oluşturur overfit olursa, ama test set için uygulandığında zayıf sonuçlar verir.

Underfit ediyorsa model, train error ve test error birbirine yakın olur, ama ikisi de high olur. Underfit durumunda, model aslında datayı az train ettiği için yeterli doğrulukta bir f bulamamış olur.


<a id="4"></a> <br>
**Diagnose Bias and Variance Problems**

Bir supervised ML model train edildiği zaman, GE'yi direkt olarak bilmemiz mümkün değildir. 

Çünkü, f is unknown
       usually only have one dataset
       noise is unpredictible

Bunun çözümü şudur:
       
       datayı önce train-test-split yaparız
       training dataset üzerinde f'i fit ederiz
       sonra test set üzerinde uygulama yaparak error evaluation yaparız. 
       test set üzerinde çıkan error GE'yi yakınsar deriz.
       Ancak, test set üzerinde uygulamadan önce, modelin performansından emin olmalıyız.
       Model performansından emin olmak için ise, Cross Validation (CV) tekniğini kullanmalıyız.
       
       



<a id="5"></a> <br>
**Cross Validation (CV)**

<a id="6"></a> <br>
**K-Fold CV**

Training set üzerinde yapılan bu uygulama için, 
     
   k=5 için örnekleme yaparsak,
     
   training data'yı 5'e bölerek(parçalara fold dedik), 5 kez datayı train ederek 5 farklı error değeri elde ederiz.
   
   Bu 5 farklı cv-error değerinin alınan ortalaması ile training set'in error'ından büyük olup olmadığı kıyaslanır.
   
   Eğer büyükse, model -->> high variance (yani overfit demek)
   
    bu sorunu aşmak için model complexity'yi düşürmek (decrease max_depth, increase min_samples_leaf...) ya da dataseti genişletmek gerekir. 
    
   Eğer cv-error ve training set error birbinine yakın ancak istenenden çok büyük ise -->> high bias (yani underfit  demek)
        
    bu sorunu aşmak için model complexity'yi artırmak (increasing max_depth, decrease min_samples_leaf...) ya da farklı features'i datasete eklemek gerekir.
    
   

In [None]:
#Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor

#Import train_test_split
from sklearn.model_selection import train_test_split

#Import accuracy_score
from sklearn.metrics import mean_squared_error as MSE

#Import cross validation score
from sklearn.model_selection import cross_val_score

In [None]:
#karışıklık çıkmasın diye df'yi baştan alıyoruz
df = pd.read_csv('../input/autompg-dataset/auto-mpg.csv')
df.drop(['car name', 'cylinders', 'model year'], axis=1, inplace=True)
df.replace('?','0', inplace=True)
df.horsepower = df.horsepower.astype('float')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
X = df.drop('mpg', axis=1).values  
y = df['mpg'].values

In [None]:
#Split dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123) 

In [None]:
#Instantiate DecisionTreeRegressor
dt = DecisionTreeRegressor(max_depth=4,min_samples_leaf=0.14, random_state=123)

In [None]:
#Evaluate the list of MSE obtained by 10-fold CV
#Set n_jobs to -1 in order to exploit all CPU cores in computation
# neg_mean_squared_error negative mse yap diyoruz metod olarak. bunun sebebi, cross val ile mse hesabının direkt oalrak yapılamaması
MSE_CV = - cross_val_score(dt, X_train, y_train, cv=10, scoring='neg_mean_squared_error', n_jobs=-1) 

In [None]:
#Fit dt to the training set
dt.fit(X_train, y_train)

#Predict the labels of the training set
y_predict_train = dt.predict(X_train)

#Predict the labels of the test set
y_predict_test = dt.predict(X_test)

In [None]:
# CV MSE:
print(MSE_CV.mean())

In [None]:
#Training Set MSE:
print(MSE(y_train, y_predict_train))
print()

#Test Set MSE:
print(MSE(y_test, y_predict_test))

In [None]:
# Training Set Error: 13.65
# Test Set Error    : 22.00
# CV Error          : 16.72
# TrainingSetError < CVError: high variance(overfit): model complexity'yi düşür (decrease max_depth, increase min_samples_leaf...)

<a id="7"></a> <br>
**Ensemble Learning**

Ensemble Learning bir supervised learning technique'dir.

CART(Classification and Regression Trees) avantajları:
- Uygulaması ve yorumlaması basit,
- features ve labels arasındaki nonlinear dependencies'i tanımlayabilme yeteneği konusundaki esneklik
- standardize/normalize ihtiyacı yok (bir preprocessing işlemi olarak)
CART(Classification and Regression Trees) dezavantajları:
- classification örneğin sadece orthogonal decision boundaries üretir.
- daha önemlisi, !! CARTs training setteki ufak değişikliklere karşı very sensitive'dir!! 
  bir data point'in training setten çıkarılması bile sonuçları drastically değiştirir.
- CART also suffer from high variance when they are trained without constraints. So they may overfit 


The solution that takes advantage of the flexibility of CARTs while reducing their tendency to memorize noise is ENSEMBLE LEARNING:

As a summary;
first step; different models are trained on the same dataset. Each model makes its own predictions. 

a meta-model then aggregates the predictions of individual models and outputs a final prediction.

the final prediction is more robust and less prone to errors than each individual model.

Best results are obtained when the models are skillful but in different ways meaning that if some models make predictions that are way off,
the other models should compensate these errors. In such cases, meta-model's predictions are more robust.


<a id="8"></a> <br>
**Voting Classifier**

Bir ensemble learning tekniği.

Binary classification task!

Bu metamodel'in prediction'ı : hard voting!



In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
df.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)

In [None]:
df.diagnosis = [1 if i == 'M' else 0 for i in df.diagnosis]

In [None]:
df.head()

In [None]:
X = df.drop('diagnosis', axis=1).values  
y = df['diagnosis'].values

In [None]:
#import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

#import models, including VotingClassifier as meta-model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier

In [None]:
#set seed for reproducability
SEED = 1

In [None]:
#Split dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

In [None]:
#Instantiate individual classifiers
lr = LogisticRegression(random_state=SEED)
knn = KNN()
dt = DecisionTreeClassifier(random_state=SEED)

In [None]:
#define a list called classifier that contains tupples (classifier_name, classifier)
classifiers = [('Logistic Regression', lr),
              ('K Nearest Neighbours', knn),
              ('Classification Tree', dt)]

In [None]:
#we can now write a for loop to iterate over the classifiers
for clf_name, clf in classifiers:
    #fit clf to the training set
    clf.fit(X_train, y_train)
    
    #predict the labels of the test set
    y_pred = clf.predict(X_test)
    
    #evaluate the accuracy of clf o the test set
    print(clf_name, ':', accuracy_score(y_test, y_pred))
    
#en iyi sonucu lr verdi.

In [None]:
#Instantiate a VotingClassifier 
vc = VotingClassifier(estimators=classifiers)

#fit vc to the training set and predict test set labels
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)

#evaluate the test-accuracy score of vc
print('Voting Classifier:', accuracy_score(y_test, y_pred))

#bu sonuç modelleri ayrı ayrı çözdüğümüzde çıkan sonuçlardan daha fazla.

<a id="9"></a> <br>
**Bagging (Bootstrap Aggregation) - for Classification and Regression**

Voting Classifier aynı train set'i farklı algoritmalarla fit ederek sonuca ulaşırken,

Bagging aynı algoritmayı data'nın subset'leri üzerinde train ederek sonuca ulaşır. 

Bagging, ensamble yapacağımız modelin variance'ını düşürme gibi istenilen bir sonuç doğurur.

subset oluştururken data'dan replacement metodu ile(yani aynı data point birden fazla kez seçilebilir) seçilerek N adet subset oluşturulur, bu N adet subset de aynı algoritmaya sokularak train edilir. Herbir oluşturulan model prediction verir. 

Meta-model bu prediction'ları toplar ve final prediction yapar. 


Classification için;
BaggingClassifier kullanırız, 
predictions'ı by majority voting ile yaparken,

Regression için;
BaggingRegressor kullanırız,
predictions'ı by averaging ile yapar.

In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
df.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)

In [None]:
df.diagnosis = [1 if i == 'M' else 0 for i in df.diagnosis]

In [None]:
X = df.drop('diagnosis', axis=1).values  
y = df['diagnosis'].values

In [None]:
#import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

#import models, including BaggingClassifier as meta-model
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

In [None]:
#set seed for reproducability
SEED = 1

In [None]:
#Split dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

In [None]:
#Instantiate DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=0.16 ,random_state=SEED)

#Instantiate BaggingClassifier
bc = BaggingClassifier(base_estimator=dt, n_estimators=300, n_jobs=-1) #300 tree var; n_jobs=-1 so that all CPU cores are used in calculation


In [None]:
#fit bc to the training set
bc.fit(X_train, y_train)

#predict test labels
y_pred = bc.predict(X_test)

#evaluate test-set accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of Bagging Classifier: {:.3f}'.format(accuracy))

#normalde dt'yi bagging yapmadan uyguladığımızda 0.88 gibi bişey çıkıyormuş.
#bagging ile dt'nin performansını artırmış olduk.

<a id="10"></a> <br>
**Out Of Bag (OOB) Evaluation**

Bagging'de data point'ler birden fazla sample'da yer alabilirler. Diğer taraftan, bazı data point'ler ise hiçbir sample'da yer almayabilir. 
Örneğin, datanın 63%'si training sample'larda yer alırken, hiçbir training sample'da yer almayan 37% lik kısma Out-Of-Bag (OOB) instances diyoruz.

OOB instances hiçbir training sette yer almayan kısım olup, cross-validation'a gerek kalmadan ensemble'ın performansını ölçmek için kullanılır.
Bu tekniğe OOB Evaluation denmektedir.

Bagging'de oluşturulan N adet sample için, her biri için, bootstrap instances (training edilecekler) ve OOB instances (dışarıda kalanlar) vardır.
Ensemble zaten Bagging ile çalışırken, dışarıda kalan N adet OOB instances ayrı ayrı evaluate edilir. Sonuç olarak N adet OOB value elde edilir ve 
ortalaması alınarak OOB score elde edilir.



In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
df.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)

In [None]:
df.diagnosis = [1 if i == 'M' else 0 for i in df.diagnosis]

In [None]:
X = df.drop('diagnosis', axis=1).values  
y = df['diagnosis'].values

In [None]:
#import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

#import models, including BaggingClassifier as meta-model
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

In [None]:
#set seed for reproducability
SEED = 1

In [None]:
#Split dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

In [None]:
#Instantiate DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=0.16 ,random_state=SEED)

#Instantiate BaggingClassifier
bc = BaggingClassifier(base_estimator=dt, n_estimators=300, oob_score=True, n_jobs=-1) #300 tree var; n_jobs=-1 so that all CPU cores are used in calculation
#ayrıca oob_score=True -> oob score hesaplayabilmek için

Not: OOB-score 
classifiers için accuracy,
regressors için r-squared score verir.

In [None]:
#fit bc to the training set
bc.fit(X_train, y_train)

#predict test labels
y_pred = bc.predict(X_test)

In [None]:
#evaluate test-set accuracy
test_accuracy = accuracy_score(y_test, y_pred)

#evaluate OOB accuracy from bc
oob_accuracy = bc.oob_score_

print('Accuracy of Bagging Classifier: {:.3f}'.format(test_accuracy))
print('OOB Accuracy of Bagging Classifier: {:.3f}'.format(oob_accuracy))
#oob accuracy ile cross validation yapmadan bagging ensemble modeli performans tahmini yapabiliriz.

<a id="11"></a> <br>
**Random Forests**

Bir ensemble learning tekniği.

Bagging'de bir base estimator'ımız vardı. Bu base estimator ile subdatalar train edilerek model çözülüyordu.

Random forest base estimator olarak decisionTree'yi kullanan bir ensemble learning tekniğidir diyebiliriz.
Ancak random forest'ta sub-data (yani bootstrap sample) eşit büyüklükte olacak şekilde train edilir. Ancak bagging'deki gibi replacement yoktur.
Data eşit olarak paylaşılmış gibi düşün.
Random forest'ta herbir tree train edilirken belirli bir sayıda (d) feature can be sampled at each node without replacement.
d default olarak feature sayısının kareköküdür. 
Diğerlerinde olduğu gibi sub-data ların train edilmesiyle oluşan predictionlar random forest meta-model tarafından toplanır ve final prediction yapılır.

Classification'da Final prediction majority voting'e göre yapılır ve RandomForestClassifier kullanılır.
Regression'da final prediction averaging ile hesaplanır ve RandomForestRegressor kullanılır.

Random Forests, individual tree'lere göre düşük bir variance'a ulaşır.


In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
df.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)

In [None]:
df.diagnosis = [1 if i == 'M' else 0 for i in df.diagnosis]

In [None]:
X = df.drop('diagnosis', axis=1).values  
y = df['diagnosis'].values

In [None]:
#import functions to compute accuracy and split data
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split

#import model as meta-model
from sklearn.ensemble import RandomForestRegressor

In [None]:
#set seed for reproducability
SEED = 1

In [None]:
#Split dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

In [None]:
#Instantiate random forest regressor with 400 estimators
rf = RandomForestRegressor(n_estimators=400, min_samples_leaf=0.12, random_state=SEED) #400 regression trees

In [None]:
#fit rf to the training set
rf.fit(X_train, y_train)

#predict test labels
y_pred = rf.predict(X_test)

In [None]:
#Evaluate RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)

print('Test set RMSE of rf: {:.2f}'.format(rmse_test))
#single regression tree'den daha düşük bir değere ulaşmışız.

Tanımda bahsetmiştik, feature importance ile ilgili çıkarımlar yapılabilir diye:

In [None]:
#create a pd.series of features importances
importances_rf = pd.Series(rf.feature_importances_, index= df.drop('diagnosis', axis=1).columns) #index aslında X, ama array olmayacağı için values'İ çıkardık.

#sort importances_rf
sorted_importances_rf = importances_rf.sort_values()

#make horizontal plot
sorted_importances_rf.plot(kind='barh', color='lightgreen')
plt.show()

<a id="12"></a> <br>
**Boosting**

Boosting bir ensemble learning metodudur.

In which many predictors are trained and each predictor learns from the errors of its predecessor.
Yani, boosting is ensemble method combining several weak learners to form a strong learner.

Weak learner; model doing slightly better than random guessing.
For example, max_depth=1 olan decision tree (buna decision stump denir) bir weak learner'dır.

In boosting, train an ensemble of predictors sequentially. and each predictor tries to correct its predecessor.

Most popular boosting methods: 
1- AdaBoost
2- Gradient Boosting



<a id="13"></a> <br>
**Adaboost (Adaptive Boosting) - Classifier and Regressor**

In adaboost, each predictor pays more attention to the instances wrongly predicted by its predecessor by changing the weights of training instances.

Each predictor is assigned a coefficient alpha that weights its contribution to final prediction.

Alpha depends on predictor's training error.

İlk olarak predictor1 is trained on the initial dataset (X,y), ve training error for predictor1 is determined.
This error alpha1 i belirlemek için kullanılır (predictor1'in coefficient'i).
Alpha1 daha sonra predictor2'nin training instances'ının weight'lerinin belirlenmesinde kullanılır. 
Bu şekilde, N adet predictor (N subdata ile modelin train edilmesi), alphalar belirlenerek incorrectly predicted instances'e dikkat edilmesi işlemi yapar.

Bir parametremiz daha var, learning rate!
Bu parametre ile number of estimators arasında bir trade-off vardır.

Classification için AdaBoostClassifier kullanılır, ensemble's prediction is obtained by weighted majority voting.
Regression için AdaBoostRegressor kullanılır, ensemble's prediction is obtained by weighted average.





In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
df.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)

In [None]:
df.diagnosis = [1 if i == 'M' else 0 for i in df.diagnosis]

In [None]:
X = df.drop('diagnosis', axis=1).values  
y = df['diagnosis'].values

In [None]:
#import functions to compute accuracy and split data
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

#import model as meta-model
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
#set seed for reproducability
SEED = 1

In [None]:
#Split dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

In [None]:
#Instantiate DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=1, random_state=SEED)

In [None]:
#Instantiate AdaBoostClassifier
adb_clf = AdaBoostClassifier(base_estimator=dt, n_estimators=100)

In [None]:
#fit adb_clf to the training set
adb_clf.fit(X_train, y_train)

#predict test set probabilities of positive class
y_pred_proba = adb_clf.predict_proba(X_test)[:,1]

In [None]:
#evaluate test set roc auc score
adb_clf_roc_auc_score = roc_auc_score(y_test, y_pred_proba) #roc_auc_score için y_pred_proba gerekliydi

In [None]:
print('Test set roc auc score of dt: {:.2f}'.format(adb_clf_roc_auc_score))

<a id="14"></a> <br>
**Gradient Boosting**

It has Proven track record of winning many machine learning competiitons.

In gradient Boosting,
Sequential correction of predecessor's errors.
adaboost'un tersine, does not tweak(çimdik) the weights of training instances. 
Bunun yerine, each predictor is trained using residual errors of its predecessor as labels.

tree1 is trained using the features matrix X and dataset labels y.
the predictions y1 are used to determine the training set residual errors r1.
r1, gerçek değerler ile y1 arasındaki farklardan oluşuyor.
ardından,
tree2 is trained using the features matrix X and the residual errors r1 of tree1 as labels.
bu şekilde, N tree train edilene kadar devam edilir.

Gradient Boosting'de kullanılan önemli parametre: Shrinkage
Shrinkage: prediction of each tree in the ensemble is shrinked after it is multiplied by a learning rate.

Yine, Bu parametre ile number of estimators arasında bir trade-off vardır.

Classification için GradientBoostingClassifier kullanılır
Regression için GradientBoostingRegressor kullanılır

In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
df.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)

In [None]:
df.diagnosis = [1 if i == 'M' else 0 for i in df.diagnosis]

In [None]:
X = df.drop('diagnosis', axis=1).values  
y = df['diagnosis'].values

In [None]:
#import functions to compute accuracy and split data
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split

#import model as meta-model
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
#set seed for reproducability
SEED = 1

In [None]:
#Split dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

In [None]:
#Instantiate GradientBoostingClassifier
gbt = GradientBoostingClassifier(n_estimators=300, max_depth=1, random_state=SEED)

In [None]:
#fit gbt to the training set
gbt.fit(X_train, y_train)

#predict the test set labels
y_pred = gbt.predict(X_test)

In [None]:
#evaluate test set
rmse_test = MSE(y_test, y_pred)**(1/2)

print('Test set RMSE of rf: {:.2f}'.format(rmse_test))

<a id="15"></a> <br>
**Stochastic Gradient Boosting (SGB)**

Gradient Boosting involves exhaustive search procedure.
Each tree in the ensemble is trained to find the best split-points and the best features.
This procedure may lead to CARTs that use the same split-points and possibly the same features.

Bunun üstesinden gelmek için, you can use Stochastic Gradient Boosting (SGB).
In Stochastic Gradient Boosting, each CART is trained on a random subset of the training data.
This subset is sampled without replacement. 
at the level of each node, features are sampled without replacement when choosing the best split-points.
As a result, this creates further diversity in the ensemble and the net effect is adding more variance to the ensemble of trees.

normalde X ve y ile training yapılırken, burada bunlardan bir sample alınarak train edilir. Predictions yapılır ve residual errors are computed.
Bu residual errors are multiplied by the learning rate, then fed to the next tree in ensemble.
Bu procedure tüm tree ler bitene kadar uygulanır.




In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)
df.diagnosis = [1 if i == 'M' else 0 for i in df.diagnosis]

In [None]:
X = df.drop('diagnosis', axis=1).values  
y = df['diagnosis'].values

In [None]:
#import functions to compute accuracy and split data
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import train_test_split

#import model as meta-model
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
#set seed for reproducability
SEED = 1

In [None]:
#Split dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

In [None]:
#Instantiate STOCHASTiC GradientBoostingClassifier
sgbt = GradientBoostingRegressor(max_depth=1, subsample=0.8, max_features=0.2, n_estimators=300, random_state=SEED)
#subsample=0.8 -> each tree to sample 80% of the data for training
#max_features=0.2 -> each tree uses 20% of available features to perform the best-split

In [None]:
#fit sgbt to the training set
sgbt.fit(X_train, y_train)

#predict the test set labels
y_pred = sgbt.predict(X_test)

In [None]:
#evaluate test set
rmse_test = MSE(y_test, y_pred)**(1/2)

print('Test set RMSE of rf: {:.2f}'.format(rmse_test))

<a id="16"></a> <br>
**Tuning a CART's Hyperparameters**

Parameters: are learned from data throUgh training.
example: split-point of a node, split feature in a CART

Hyperparameters: are not learned from data. they should be set prior to training.
example: max_depth, min_sample_leaf, in a CART

Hyperparameter tuning consists of searching for the set of optimal hyperparameters for the learning algorithm.

Birçok tuning metodu var ancak biz grid search görücez.

<a id="17"></a> <br>
**Grid Search Cross Validation for DecisionTreeClassifier**

first, you manually set a grid of hyperparameter values.

then, you pick a metric for scoring model performance and you search exhaustively through the grid.

For each set of hyperparameters, you evaluate each model's CV score.

The optimal hyperparameters are those of the model achieving the best CV score.


In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)
df.diagnosis = [1 if i == 'M' else 0 for i in df.diagnosis]

In [None]:
X = df.drop('diagnosis', axis=1).values  
y = df['diagnosis'].values

In [None]:
from sklearn.model_selection import train_test_split
#Split dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

In [None]:
#import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
#set seed for reproducability
SEED = 1

In [None]:
#Instantiate a DecisionClassifier dt
dt = DecisionTreeClassifier(random_state=SEED)

In [None]:
#print out dt's hyperparameters:
print(dt.get_params())
#biz sadece max_depth, max_features ve min_samples_leaf i optimize edelim.
#max_feature: nr of features to consider when looking for the best split


In [None]:
#Import GridSearchCV 
from sklearn.model_selection import GridSearchCV

#define the grid of hyperparameters 
params_dt = {'max_depth': [3,4,5,6],
             'min_samples_leaf': [0.04, 0.06, 0.08],
             'max_features': [0.2, 0.4, 0.6, 0.8]}

#Instantiate a 10-fold CV grid search object 
grid_dt = GridSearchCV(estimator=dt, 
                       param_grid=params_dt,
                       scoring='accuracy',
                       cv=10,
                       n_jobs=-1)

#fit grid_dt to the training set
grid_dt.fit(X_train, y_train)

#extract best hyperparameters from 'grid_dt'
best_hyperparams = grid_dt.best_params_
print('best hyperparameters:', best_hyperparams)

#extract best CV score from grid_dt
best_CV_score = grid_dt.best_score_
print('best CV score:', best_CV_score)

#extract best model from grid_dt
best_model = grid_dt.best_estimator_
print('best model:', best_model)

#evaluate test set accuracy
test_acc = best_model.score(X_test, y_test)
print('test set accuracy of best model:', test_acc)


<a id="18"></a> <br>
**Grid Search Cross Validation for RandomForestRegressor**

CART hyperparameters'a ek olarak number of estimators, bootstrap(True or False) ekleyebiliriz RandomForest'ta.



In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
df.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)
df.diagnosis = [1 if i == 'M' else 0 for i in df.diagnosis]

In [None]:
X = df.drop('diagnosis', axis=1).values  
y = df['diagnosis'].values

In [None]:
#Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

In [None]:
#set seed for reproducability
SEED = 1

In [None]:
from sklearn.model_selection import train_test_split
#Split dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)

In [None]:
#Instantiate RandomForestRegressor
rf = RandomForestRegressor(random_state=SEED)

In [None]:
#Inspect rf's hyperparameters
rf.get_params()
#we will optimize n_estimators, max_depth, min_samples_leaf, max_features

In [None]:
#Import GridSearchCV and metric MSE
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error as MSE

#define the grid of hyperparameters 
params_rf = { 'n_estimators': [300, 400, 500],
              'max_depth': [4,6,8],
              'min_samples_leaf': [0.1, 0.2],
              'max_features': ['log2', 'sqrt']}

#Instantiate a 3-fold CV grid search object 
grid_rf = GridSearchCV(estimator=rf, 
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error', #negative mse
                       cv=3,
                       verbose=1, #verbose: gereksiz sözlerle dolu demek, verbosity'yi kontrol etmek içinmiş
                       n_jobs=-1)

#fit grid_dt to the training set
grid_rf.fit(X_train, y_train)

In [None]:
#extract best hyperparameters
best_hyperparams = grid_rf.best_params_
print('best parameters: \n',best_hyperparams)

#extract best model 
best_model = grid_rf.best_estimator_
print('\nbest model: \n',best_model)

In [None]:
#predict the test set labels
y_pred = best_model.predict(X_test)

#Evaluate the test set RSME
rsme_test = MSE(y_test, y_pred)**(1/2)
print('RSME of test set:{:.2f}'.format(rsme_test) )