# Ensemble Learning 

In this notebook, I implement the ensemble Learning  based on the * **Shill Bidding Dataset** *, including Bagging、Random Forests、Adaboost and Gradient Boosting.

I implement the algorithm with the notes defined in * **Lecture 9.1~9.2 Ensemble Methods.** *.

## Algorithm Inplement
---

In [4]:
# Import the libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,mean_squared_error,r2_score
from sklearn.metrics import classification_report

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier,AdaBoostClassifier,GradientBoostingRegressor

In [6]:
# Import the data
df = pd.read_csv("../3. Data/Shill_Bidding_Dataset.csv")
df.head()

Unnamed: 0,Record_ID,Auction_ID,Bidder_ID,Bidder_Tendency,Bidding_Ratio,Successive_Outbidding,Last_Bidding,Auction_Bids,Starting_Price_Average,Early_Bidding,Winning_Ratio,Auction_Duration,Class
0,1,732,_***i,0.2,0.4,0.0,2.8e-05,0.0,0.993593,2.8e-05,0.666667,5,0
1,2,732,g***r,0.02439,0.2,0.0,0.013123,0.0,0.993593,0.013123,0.944444,5,0
2,3,732,t***p,0.142857,0.2,0.0,0.003042,0.0,0.993593,0.003042,1.0,5,0
3,4,732,7***n,0.1,0.2,0.0,0.097477,0.0,0.993593,0.097477,1.0,5,0
4,5,900,z***z,0.051282,0.222222,0.0,0.001318,0.0,0.0,0.001242,0.5,7,0


We create the training set and testing set.

In [7]:
X = df.drop(['Record_ID','Auction_ID','Bidder_ID','Class'],axis=1)
y = df.Class

# Create a training set and a testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=23)

### Bagging

From Scikit-Learn, we import ```DecisionTreeClassifier``` and ```BaggingClassifier```. We instantiate an instance of ```BaggingClassifier``` which trains an ensemble of 500 ```DecisionTreeClassifier``` with max_depth = 3.

In [8]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(max_depth=3, random_state=23),
                            n_estimators = 500,
                            bootstrap = True,
                            n_jobs = -1)
bag_clf.fit(X_train, y_train)
bag_y_pred = bag_clf.predict(X_test)
print(f"Bagging Classification Report")
print(classification_report(y_test, bag_y_pred), "\n")

Bagging Classification Report
             precision    recall  f1-score   support

          0       1.00      0.98      0.99      1699
          1       0.86      0.98      0.92       198

avg / total       0.98      0.98      0.98      1897
 



### Random Forests

In [9]:
forest_clf = RandomForestClassifier(max_depth = 3, 
                                    n_estimators = 500,
                                    bootstrap = True,
                                    n_jobs = -1)
forest_clf.fit(X_train, y_train)
forest_y_pred = forest_clf.predict(X_test)
print(f"Forest Classification Report")
print(classification_report(y_test, forest_y_pred), "\n")

Forest Classification Report
             precision    recall  f1-score   support

          0       1.00      0.98      0.99      1699
          1       0.85      0.98      0.91       198

avg / total       0.98      0.98      0.98      1897
 



The accuracy for random forests is same as bagging. However, the precision of positive sample performed by random forest is better than bagging.

### Random Feature Importance 
Another great quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature’s importance bylooking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). More precisely, it is a weighted average, where each
node’s weight is equal to the number of training samples that are associated with it.

In [11]:
names = X_train.columns.tolist()
for name, score in zip(names, forest_clf.feature_importances_):
    print(name, score)

Bidder_Tendency 0.03518456451080783
Bidding_Ratio 0.2139412826006667
Successive_Outbidding 0.598802971783876
Last_Bidding 0.006190933950224021
Auction_Bids 0.036076670697468054
Starting_Price_Average 0.008009247482470842
Early_Bidding 0.006784865384835935
Winning_Ratio 0.09126221818740424
Auction_Duration 0.00374724540224616


We can see that the feature importance for 'Successive_Outbidding','Bidding_Ratio', 'Winning_Ratio' is higher than other attributes.

### AdaBoost

In [12]:
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=3, random_state=23), 
                             n_estimators = 10,
                             algorithm = "SAMME.R",
                             learning_rate = 0.5)
ada_clf.fit(X_train, y_train)
ada_y_pred = ada_clf.predict(X_test)
print(f"AdaBoost Classification Report")
print(classification_report(y_test, ada_y_pred), "\n")

AdaBoost Classification Report
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      1699
          1       0.99      0.97      0.98       198

avg / total       1.00      1.00      1.00      1897
 



The model performance for AdaBoost is better than Random Forests.

### Gradient Boosting

In [20]:
gb_reg = GradientBoostingRegressor(max_depth = 3, n_estimators = 150, random_state=23)
gb_reg.fit(X_train, y_train)
gb_y_pred = gb_reg.predict(X_test)

print(f"MSE for gb_reg is: {mean_squared_error(y_test, gb_y_pred)}")
print(f"The R^2 for gb_reg is: {round(r2_score(y_test, gb_y_pred), 3)}")

MSE for gb_reg is: 0.005057274697830985
The R^2 for gb_reg is: 0.946
