### Bootstrap Aggregation (bagging)

Ensemble machine learning techniques combines weak learners into a strong learner; the random forest model uses boostrap aggregation as part of its algorithm.

Note: decision trees are prone to overfitting, the algorithm's predictions are excessively tailored to the specific dataset. When there is overfitting, a model's performance will fail when it encounters a new dataset. 

Summary: bootstrap aggregation can solve the problem of overfitting. 


### Boosting

In boosting, the weak learners are not combined at the same time; they are used sequentially, as one model learns from the mistakes of the previous model.

In contrast to bagging/boostrap, boosting trains a sequence of weak models. 

### Adaptive Boosting (AdaBoost)
A model is trained, then evaluated for errors. Ater this, another model is trained. However, the model gives extra weight to errors from the previous model. The purpose of weighing is to minimize similar errors in subsequent models. The repocess repeats until the error rate is minimized. 

### Gradient boosting
In contrast to Adaboost, gradient boosting does not seek to minimize errors by adjusting the weight of the errors. 
Process of gradient boosting:
    
    1. A small tree (stump) is added to the model, and the errors are evaluated.
    
    2. A second stump is added to the first and attempts to minimize the errors from the first stump. These errors are called pseudo-residuals.
    
    3. A third stump is added to the first two and attempts to minimize the psuedo-residuals from the previous two.
    
    4. The process is repeated until errors are minimized as much as possible or until a specified number of repetitions have been reached (may this be similar to epochs in deep/neural networks machine learning?)

### Import dependencies and load data

In [1]:
import pandas as pd
from path import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
file_path = Path("../Resources/loans_data_encoded.csv")
loans_df = pd.read_csv(file_path)
loans_df.head()

Unnamed: 0,amount,term,age,bad,education_Bachelor,education_High School or Below,education_Master or Above,education_college,gender_female,gender_male,month_num
0,1000,30,45,0,0,1,0,0,0,1,6
1,1000,30,50,0,1,0,0,0,1,0,7
2,1000,30,33,0,1,0,0,0,1,0,8
3,1000,15,27,0,0,0,0,1,0,1,9
4,1000,30,28,0,0,0,0,1,1,0,10


### Separate Feature and Target Column

In [3]:
# Define features set
X = loans_df.copy()
X = X.drop("bad", axis=1)
X[:5]

Unnamed: 0,amount,term,age,education_Bachelor,education_High School or Below,education_Master or Above,education_college,gender_female,gender_male,month_num
0,1000,30,45,0,1,0,0,0,1,6
1,1000,30,50,1,0,0,0,1,0,7
2,1000,30,33,1,0,0,0,1,0,8
3,1000,15,27,0,0,0,1,0,1,9
4,1000,30,28,0,0,0,1,1,0,10


In [4]:
# Define target vector
y = loans_df["bad"].values
y[:5]

array([0, 0, 0, 0, 0])

### Split Into training and Testing Sets

In [5]:
# Split into train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X,
   y, random_state=1)

### Scale Data

In [6]:
# Create scaler instance.
scaler = StandardScaler()

# Fit standard scaler
X_scaler = scaler.fit(X_train)

# Scaling data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

### Choose best learning rate via for loop
Note: for loop is used to identify the learning rate that yields the best performance.

In [15]:
from sklearn.ensemble import GradientBoostingClassifier

# Create a classifier object
learning_rates = [0.05, 0.1, 0.25, 0.5, 0.75, 1]
for learning_rate in learning_rates:
    classifier = GradientBoostingClassifier(n_estimators=20,
                                            learning_rate=learning_rate,
                                            max_features=5,
                                            max_depth=3,
                                            random_state=0)

    # Fit the model
    classifier.fit(X_train_scaled, y_train)
    print("Learning rate: ", learning_rate)

    # Score the model
    print("Accuracy score (training): {0:.3f}".format(
        classifier.score(
            X_train_scaled,
            y_train)))
    print("Accuracy score (validation): {0:.3f}".format(
        classifier.score(
            X_test_scaled,
            y_test)))
    print()

Learning rate:  0.05
Accuracy score (training): 0.627
Accuracy score (validation): 0.520

Learning rate:  0.1
Accuracy score (training): 0.651
Accuracy score (validation): 0.520

Learning rate:  0.25
Accuracy score (training): 0.715
Accuracy score (validation): 0.552

Learning rate:  0.5
Accuracy score (training): 0.789
Accuracy score (validation): 0.560

Learning rate:  0.75
Accuracy score (training): 0.816
Accuracy score (validation): 0.552

Learning rate:  1
Accuracy score (training): 0.797
Accuracy score (validation): 0.528



Note: testing accuracy (validation) is more important than training accuracy. A model that performs well on the training set but poorly on the testing set is said to be "overfit". Overfiting occurs when a model gives undue importance to patterns within a particular dataset that are not found in other, similar dataset. Instead of learning a general pattern that can be applied to other similar datasets, it learns the patterns specific to one dataset. 

### Create gradient boost classfier

1) For each learning rate value, a GradientBoostingClassifier model is instantiated

2) Max_depth argument refers to the size of the decision tree stumps used in gradient boosting.

3) n_estimators argument refers to the number of trees used

In [16]:
# Choose a learning rate and create classifier
classifier = GradientBoostingClassifier(n_estimators=20,
                                        learning_rate=0.5,
                                        max_features=5,
                                        max_depth=3,
                                        random_state=0)

# Fit the model
classifier.fit(X_train_scaled, y_train)

# Make Prediction
predictions = classifier.predict(X_test_scaled)
pd.DataFrame({"Prediction": predictions, "Actual": y_test}).head(20)

Unnamed: 0,Prediction,Actual
0,0,1
1,0,1
2,0,0
3,1,0
4,1,1
5,1,1
6,0,1
7,0,0
8,0,0
9,0,0


### Evaluate the Model

In [17]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Evaluate accuracy score
acc_score = accuracy_score(y_test, predictions)
print(f"Accuracy Score : {acc_score}")

Accuracy Score : 0.56


In [18]:
# Generate confusion matrix of the results.
cm = confusion_matrix(y_test, predictions)
cm_df = pd.DataFrame(
   cm, index=["Actual 0", "Actual 1"],
   columns=["Predicted 0", "Predicted 1"]
)
display(cm_df)

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,48,17
Actual 1,38,22


In [19]:
# Generate classification report
print("Classification Report")
print(classification_report(y_test, predictions))

Classification Report
              precision    recall  f1-score   support

           0       0.56      0.74      0.64        65
           1       0.56      0.37      0.44        60

    accuracy                           0.56       125
   macro avg       0.56      0.55      0.54       125
weighted avg       0.56      0.56      0.54       125



Based on the accuracy of the gradient boost model (0.56 or 56%, it's still not a good model for determining loan applications. 