<a href="https://colab.research.google.com/github/stefaniemeliss/IADS_SC_2022_DT/blob/main/GradientBoostingClassifier_IADS_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Gradient Boosting for Classification**
## 1. IRIS Data
## 2. Mushroom Classification Data (Kaggle)






In [24]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 1. IRIS DATA


## Dataset for Classification




> **Dataset:**  [Iris dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris).



*   **Number of Instances:** 
    *   150 (50 in each of three classes)
*   **Number of Attributes:**
    *   4 numeric, predictive attributes and the class

*   **Attribute Information:**
    *   sepal length in cm
    *   sepal width in cm
    *   petal length in cm
    *   petal width in cm

*   **Classes:**
    *   Setosa (0)
    *   Versicolour (1)
    *   Virginica (2)
    






In [25]:
# Add liberaries 
from sklearn import datasets  # DATA
from sklearn.model_selection import train_test_split # to Split Train-Test data
from sklearn import ensemble # To get Gradient Boosting classifier 
from sklearn import metrics # To generate evaluation metrices
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score


from sklearn.tree import export_graphviz # exporting the tree structure as dot file
from pydotplus.graphviz import graph_from_dot_data # export png image from dot file
from IPython.display import Image, SVG # Show the image within colab notebook
from graphviz import Source
import matplotlib.pyplot as plt


import pandas as pd # for basic data manipulations 
import numpy as np
import warnings
warnings.filterwarnings('ignore')


### 1. Load Data

In [26]:
#load data and see meta info
iris = datasets.load_iris()
dir(iris)

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

### 2. Explore Data


In [27]:
# print type and shape of data
print(type(iris.data))
print(type(iris.target))

print(iris.data.shape)
print(iris.target.shape)

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
(150, 4)
(150,)


### 3. Create Panda Dataframe and do data manipulations

In [28]:
dfCls = pd.DataFrame(iris.data, columns=iris.feature_names)
# dfCls.head()

In [29]:
# Add target data to the panda dataframe
dfCls['target'] = iris.target
# dfCls.head()

### 4. Split the data for Training and Testing

In [30]:
X_train, X_test, y_train, y_test = train_test_split(dfCls.drop(['target'],axis='columns'), iris.target, test_size=0.2,random_state=0, stratify=iris.target)
print(X_train.shape)
print(X_test.shape)

(120, 4)
(30, 4)


### 5. Initialise a Gradient Boosting Classifier

In [31]:
gbClassifier = ensemble.GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=500,
                           n_iter_no_change=None,
                           random_state=0, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)



> ***Let's dig into*** **[tree.GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier)**




### 6. Model Evaluation on Train data

In [32]:
#perform 10 fold cross validation and plot the CM
CV_predicted = cross_val_predict(gbClassifier, X_train, y_train, cv=10) #CV predicted values (training data)
CV_score = cross_val_score(gbClassifier, X_train, y_train, cv=10) #CV model score (training data)

print("Cross validation Score on train data: ",CV_score.mean())
print("\n")

print("Confusion matrix on CV predictions (train data)")
print(metrics.confusion_matrix(y_train, CV_predicted)) # confusion matrix on CV predictions (train data)
print("\n")

print("Classification report CV predictions (train data)")
print(metrics.classification_report(y_train, CV_predicted, target_names=['Setosa', 'Versicolor', 'Virginica'])) # classification report CV predictions (train data)


Cross validation Score on train data:  0.9499999999999998


Confusion matrix on CV predictions (train data)
[[40  0  0]
 [ 0 37  3]
 [ 0  3 37]]


Classification report CV predictions (train data)
              precision    recall  f1-score   support

      Setosa       1.00      1.00      1.00        40
  Versicolor       0.93      0.93      0.93        40
   Virginica       0.93      0.93      0.93        40

    accuracy                           0.95       120
   macro avg       0.95      0.95      0.95       120
weighted avg       0.95      0.95      0.95       120



### 7. Let's fit the GB model on Training data and perform prediction with the Test data 

In [33]:
gbClassMdl = gbClassifier.fit(X_train,y_train)

y_predicted = gbClassMdl.predict(X_test)

### 8. Model Evaluation on Test Data

In [34]:
mdl_score = gbClassMdl.score(X_test,y_test) #model score (test data)
print ("Model Score on test data:",mdl_score)
print("\n")

print("Confusion matrix (test data)")
print(metrics.confusion_matrix(y_test, y_predicted)) #confusion matrix (test data)
print("\n")

print("Classification report (test data)")
print(metrics.classification_report(y_test, y_predicted, target_names=['Setosa', 'Versicolor', 'Virginica'])) # classification report (test data)

Model Score on test data: 0.9666666666666667


Confusion matrix (test data)
[[10  0  0]
 [ 0 10  0]
 [ 0  1  9]]


Classification report (test data)
              precision    recall  f1-score   support

      Setosa       1.00      1.00      1.00        10
  Versicolor       0.91      1.00      0.95        10
   Virginica       1.00      0.90      0.95        10

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30



# 2. Mushroom Classification Data (Kaggle)
[See further details](https://www.kaggle.com/uciml/mushroom-classification)



### 1. Load Data

In [35]:
#load data from local drive
mushData = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/mushrooms.csv')

### 2. Explore Data

In [36]:
#print first five rows of the data
mushData.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [37]:
#print size of the data
mushData.shape

(8124, 23)

In [38]:
#print data attributes
mushData.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [39]:
#print key informations about the data
mushData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

In [40]:
#Check the class balance
mushData['class'].value_counts()

e    4208
p    3916
Name: class, dtype: int64

### 3. Perform data manipulations

In [41]:
from sklearn.preprocessing import LabelEncoder 
labelencoder=LabelEncoder()
for col in mushData.columns:
    mushData[col] = labelencoder.fit_transform(mushData[col]) #Transform categrical data to numerical data. caveat: note though that labelencoder transforms from categorical to ordinal level (!)
 
mushData.head() #print first five rows of the data

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1


In [42]:
mushData.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,...,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0,8124.0
mean,0.482029,3.348104,1.827671,4.504677,0.415559,4.144756,0.974151,0.161497,0.309207,4.810684,...,1.603644,5.816347,5.794682,0.0,1.965534,1.069424,2.291974,3.59675,3.644018,1.508616
std,0.499708,1.604329,1.229873,2.545821,0.492848,2.103729,0.158695,0.368011,0.462195,3.540359,...,0.675974,1.901747,1.907291,0.0,0.242669,0.271064,1.801672,2.382663,1.252082,1.719975
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,3.0,0.0,2.0,1.0,0.0,0.0,2.0,...,1.0,6.0,6.0,0.0,2.0,1.0,0.0,2.0,3.0,0.0
50%,0.0,3.0,2.0,4.0,0.0,5.0,1.0,0.0,0.0,5.0,...,2.0,7.0,7.0,0.0,2.0,1.0,2.0,3.0,4.0,1.0
75%,1.0,5.0,3.0,8.0,1.0,5.0,1.0,0.0,1.0,7.0,...,2.0,7.0,7.0,0.0,2.0,1.0,4.0,7.0,4.0,2.0
max,1.0,5.0,3.0,9.0,1.0,8.0,1.0,1.0,1.0,11.0,...,3.0,8.0,8.0,0.0,3.0,2.0,4.0,8.0,5.0,6.0


In [43]:
target = mushData['class'] #get the labels as targets and convert to numpy array
np.array(target, dtype=pd.Series)

array([1, 0, 0, ..., 0, 1, 0], dtype=object)

### 4. Split the data for Training and Testing

In [44]:
X_train, X_test, y_train, y_test = train_test_split(mushData.drop(['class'],axis='columns'), target, test_size=0.2,random_state=123, stratify=target)
print(X_train.shape)
print(X_test.shape)

(6499, 22)
(1625, 22)


### 5. Perform Grid search for getting the best parameters

In [45]:
from sklearn.model_selection import GridSearchCV # get gridsearch with cross validation

In [46]:
#provide GB hyperparameters
gb_hyperparameters = {
    "n_estimators": [50, 100],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [1, 3, 5]
}

nfolds = 10 #number of folds for CV
gbClassifier = ensemble.GradientBoostingClassifier(random_state=123) #initialise GB classifier

# create Grid search object
gs_gb_clf = GridSearchCV(gbClassifier, gb_hyperparameters, 
                          n_jobs=-1, cv=nfolds,
                          scoring='accuracy')





In [47]:
gs_gb_clf.fit(X_train, y_train) #fit the grid search object

GridSearchCV(cv=10, estimator=GradientBoostingClassifier(random_state=123),
             n_jobs=-1,
             param_grid={'learning_rate': [0.05, 0.1, 0.2],
                         'max_depth': [1, 3, 5], 'n_estimators': [50, 100]},
             scoring='accuracy')

In [48]:
print(gs_gb_clf.best_score_)
print(gs_gb_clf.best_params_)

1.0
{'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 100}


In [49]:
best_parameters_gs = gs_gb_clf.best_params_ #get the best parameters based on 10x CV grid search

### 6. Initialise a Gradient Boosting Classifier

In [50]:
gbClassifier_best = ensemble.GradientBoostingClassifier(**best_parameters_gs, random_state=123) #intialise GB classifier with best set of parameters


### 7. Model Evaluation on Train data

In [51]:
#perform 10 fold cross validation and plot the CM
CV_predicted = cross_val_predict(gbClassifier_best, X_train, y_train, cv=10) #CV predicted values (training data)
CV_score = cross_val_score(gbClassifier_best, X_train, y_train, cv=10) #CV model score (training data)

print("Cross validation Score on train data: ",CV_score.mean())
print("\n")

print("Confusion matrix on CV predictions (train data)")
print(metrics.confusion_matrix(y_train, CV_predicted)) # confusion matrix on CV predictions (train data)
print("\n")

print("Classification report CV predictions (train data)")
print(metrics.classification_report(y_train, CV_predicted, target_names=['Poisonous', 'Edigble'])) # classification report CV predictions (train data)

Cross validation Score on train data:  1.0


Confusion matrix on CV predictions (train data)
[[3366    0]
 [   0 3133]]


Classification report CV predictions (train data)
              precision    recall  f1-score   support

   Poisonous       1.00      1.00      1.00      3366
     Edigble       1.00      1.00      1.00      3133

    accuracy                           1.00      6499
   macro avg       1.00      1.00      1.00      6499
weighted avg       1.00      1.00      1.00      6499



### 8. Model Evaluation on Test data

In [52]:
gbClassifier_best_mdl= gbClassifier_best.fit(X_train, y_train) #fit the best GB classifier with training data

y_predicted = gbClassifier_best_mdl.predict(X_test) #Predict the outcomes with best GB classifier for test data

In [53]:
mdl_score = gbClassifier_best_mdl.score(X_test,y_test) #model score (test data)
print ("Model Score on test data:",mdl_score)
print("\n")

print("Confusion matrix (test data)")
print(metrics.confusion_matrix(y_test, y_predicted)) #confusion matrix (test data)
print("\n")

print("Classification report (test data)")
print(metrics.classification_report(y_test, y_predicted, target_names=['Poisonous', 'Edigble'])) # classification report (test data)

Model Score on test data: 1.0


Confusion matrix (test data)
[[842   0]
 [  0 783]]


Classification report (test data)
              precision    recall  f1-score   support

   Poisonous       1.00      1.00      1.00       842
     Edigble       1.00      1.00      1.00       783

    accuracy                           1.00      1625
   macro avg       1.00      1.00      1.00      1625
weighted avg       1.00      1.00      1.00      1625

