# Chapter-7: Random forests and Boosting

## Ensemble models:
Building several models instead of one model is called ensemble model building techniques. Ensemble models are analogous to a concept in physiology called the wisdom of crowds. If we are unsure and not able to decide the right option, then we can follow the most liked option by the
crowd. Given that everyone in the crowd is smart and everyone have independent opinions.

In [None]:
import pandas as pd
import sklearn as sk
import numpy as np
import scipy as sp
from sklearn  import model_selection
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_curve, auc, f1_score,confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
github_link="https://raw.githubusercontent.com/venkatareddykonasani/ML_DL_py_TF/master/Chapter7_RF_Boosting/Datasets/"

## Bagging:
Bagging stands for Bootstrap Aggregating(Bootstrap Aggregating). Bagging has two steps bootstrap sampling and aggregating the classification results.

**Bootstrap sampling**

While drawing the bootstrap sample of 10 records, we do not sample all the records in one shot. We draw one sample record at a time. For example, there are ten records; we draw one record randomly, let that be record 7. We now draw one more sample, between 1 to 10, let that be 4. We now have two samples. Let us draw the third sample, let that be 7. We repeat this process ten times to create a sample of 10 records. The important point to note here is, while picking the second record, we consider all ten records of the population; this may lead to duplication of some of the data points. In a bootstrap sample, few records are repeated multiple times, and few records are never picked. The above example creates a bootstrap sample set-1. We can repeat the same process and create a bootstrap sample set-2 and more.

The bootstrap sample is also known as a sample with replacement.

#### Bagging Algorithm:
1. Draw K bootstrap samples, a higher value of K is preferred.
2. On each of the bootstrap samples build a model
3. Collate the results for the new data points based on the average for regression, maximum votes for classification models.

## Random Forest
In the bagging algorithm, instead of building any model, if we specifically build all decision trees,
then it is called a Random Forest.

**Algorithm**:

1. Draw K bootstrap samples, a higher value of K is preferred.
2. On each of the bootstrap samples, build a decision tree model. While building this model, do
not consider all the variables for splitting
    
    a. Consider only randomly selected p-variables while splitting every node. If there are t-variables in the data then p<<t.
    
    b. Use the variable with the highest information gain out of these p-variables to split the node into child nodes
    c. Go to each child node and repeat the above two steps of selecting p-variables randomly and using the best variable to split the node.
    d. Grow the tree as long as possible without pruning
3. Collate the results for the new data points based on class with maximum votes.

**Case Study**

The data set has data from 22 sensors and one target variable. Below are some of the basic details of
the data.

In [None]:
#car_train=pd.read_csv(r"/content/drive/My Drive/DataSets/Chapter-7/datasets/car_accidents/car_sensors.csv")
car_train=pd.read_csv(github_link + "/car_accidents/car_sensors.csv")

In [None]:
print(car_train.shape)

In [None]:
print(car_train.columns)

In [None]:
print(car_train.info())

From the above output, we can see that there are 33,239 records in the dataset. There are 23
variables in the data. The target variable name is “safe” other variables are representing the data
collected from 22 sensors. All the columns are numerical. We will perform basic data exploration on
the predictor and target variables.

In [None]:
all_cols_summary=car_train.describe()
print(round(all_cols_summary,2))

In [None]:
print(car_train['safe'].value_counts())

The target variable takes two values, 1-Safe, 0-Not safe. If we know the complete details about these
sensors, then we can perform some feature engineering tasks. We will go ahead with the model
building for now.

In [None]:
features=car_train.columns.values[1:]
print(features)
X = car_train[features]
y = car_train['safe']
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y ,test_size=0.2, random_state=55)

In [None]:
print("X_train Shape ",X_train.shape)
print("y_train Shape ", y_train.shape)
print("X_test Shape ",X_test.shape)
print("y_test Shape ", y_test.shape)

Now we will build a decision tree model on training data

In [None]:
D_tree = tree.DecisionTreeClassifier(max_depth=7)
D_tree.fit(X_train,y_train)

In [None]:
tree_predict1=D_tree.predict(X_train)
cm1 = confusion_matrix(y_train,tree_predict1)
accuracy_train=(cm1[0,0]+cm1[1,1])/sum(sum(cm1))
print("Decison Tree Accuracy on Train data = ", round(accuracy_train,2) )

In [None]:
tree_predict2=D_tree.predict(X_test)
cm2 = confusion_matrix(y_test,tree_predict2)
accuracy_test=(cm2[0,0]+cm2[1,1])/sum(sum(cm2))
print("Decison Tree Accuracy on Test data = ", round(accuracy_test,2) )

In [None]:
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, tree_predict1)
auc_train = auc(false_positive_rate, true_positive_rate)
print("Decison Tree AUC on Train data = ", round(auc_train,2) )

In [None]:
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, tree_predict2)
auc_test = auc(false_positive_rate, true_positive_rate)
print("Decison Tree AUC on Test data = ", round(auc_test,2) )

After many trials, we will arrive at the optimal max_depth for this data, which is max_depth=7.
The best decision tree gives us 88% accuracy and AUC of 87%. If we try a higher max_depth then we
will be getting into overfitting zone. We will now build a random forest model. The two important
hyperparameters are the number of trees and the number of features. We can set the max_depth to
a fixed number, say 10 in this case.

In [None]:
R_forest=RandomForestClassifier(n_estimators=300, max_features=4, max_depth=10)
R_forest.fit(X_train,y_train)

**Code Description**

* n_estimators = 300. We are building 300 trees here. A higher number is preferred. If the dataset size is large, then it can be a smaller number. We can try 100-500 and choose the optimal value.
* max_features=4 . The number of features randomly chosen. A lower number is preferred. We can try 3,4,5
* max_depth=10. We can fix this at a slightly higher value as compared to a standard decision tree. If we try a lower value for this, we will get very little accuracy on train and test data.

In [None]:
forest_predict1=R_forest.predict(X_train)
cm1 = confusion_matrix(y_train,forest_predict1)
accuracy_train=(cm1[0,0]+cm1[1,1])/sum(sum(cm1))
print("Random Forest Accuracy on Train data = ", round(accuracy_train,2) )

In [None]:
forest_predict2=R_forest.predict(X_test)
cm2 = confusion_matrix(y_test,forest_predict2)
accuracy_test=(cm2[0,0]+cm2[1,1])/sum(sum(cm2))
print("Random Forest Accuracy on Test data = ", round(accuracy_test,2) )

In [None]:
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train, forest_predict1)
auc_train = auc(false_positive_rate, true_positive_rate)
print("Random Forest AUC on Train data =  ", round(auc_train,2) )

In [None]:
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, forest_predict2)
auc_test= auc(false_positive_rate, true_positive_rate)
print("Random Forest AUC on Test data =  ", round(auc_test,2) )

The Random Forest gives us 91% accuracy and AUC of 90% on the test data. When compared to a
single decision tree, we can see an improvement of 3%. When compared to a decision tree, we
generally observe an improvement of 1% to 5%.

## Boosting
In Bagging, we built multiple models in parallel. In Boosting, also we build various weak models and
combine them to form a robust model. Nevertheless, in Boosting, we build them sequentially, that is
the main difference between Bagging and Boosting. If we think bagging as the wisdom of crowds,
then boosting is the wisdom of crowds with some weight given to individuals based on their skill.

### Ada Boosting Algorithm

**Step-1:** Data and Weak Classifier

**Step-2:** Error calculation and Weighted Sample

**Step-3:** Rebuild and Repeat

**Step-4:** Stopping Criteria

### Gradient Boosting
Gradient Boosting algorithm can be easily understood with Regression. We take the whole training
data. We build the first regression model. This model may not be perfect. We will take the
predictions from this model to calculate the errors. We will build a new model that can exclusively
learn these errors. We will recalculate the error and repeat this process.

**Gradient Boosting Algorithm**

**Step-1:** Initial Model

**Step-2:** Residuals Calculation

**Step-3:** Build Model on Residuals

**Step-4:** Update the Residuals and Update the Model

**Step-5:** Stopping Criterion 

#### Hyperparameters in boosting
* **number of iterations(n)**: Boosting is an iterative algorithm. The error reduces in each iteration. The number of iterations is the first hyperparameter in boosting. A large number will lead to overfitting, and a minimal number will lead to underfitting.

* **shrinkage or learning rate**: Instead of considering the predictions as it is from the residual models, we shrink them by factor . Shrinkage or learning rate parameter makes the whole learning process very slow. The error reduction will be very less in each iteration.

* **Size of the tree**: While solving practical problems, it is always preferable to make the trees
learn slowly. The real power of the ensemble is in building weak models and collating them.
To make the individual trees as weak learners, we need to set the max_depth of each tree as
a small value.

We will use a dataset to understand the boosting and learning rate parameter in-depth. Pet
adoption data has two columns. One is the age of the customer and the target column. The target
column has two classes 0’s and 1’s. 0- Not adopted the pet and 1- Adopted the pet. We will try using
the GBM model to predict whether the customer’s likelihood to adopt a pet based on their age.

In [None]:
import pandas as pd
#pets_data = pd.read_csv(r"/content/drive/My Drive/DataSets/Chapter-7/datasets/Pet_adoption/adoption.csv")
pets_data = pd.read_csv(github_link+"/Pet_adoption/adoption.csv")
pets_data.columns.values
pets_data.head(10)

In [None]:
X=pets_data[["cust_age"]]
y=h=pets_data['adopted_pet']

In [None]:
for i in range (1,21):
    
    #Model and predictions 
    boost_model=GradientBoostingClassifier(n_estimators=i,learning_rate=1, max_depth=1)
    boost_model.fit(X,y)
    pets_data["itaration_result"]=boost_model.predict_proba(X)[:,1]
    boost_predict= boost_model.predict(X)
    
    #Graph
    fig = plt.figure()
    plt.rcParams["figure.figsize"] = (7,5)
    plt.title(['Iteration :', i ], fontsize=20)
    ax1 = fig.add_subplot(111)
    ax1.scatter(pets_data["cust_age"],pets_data["adopted_pet"], s=50, c='b', marker="x")
    ax1.scatter(pets_data["cust_age"],pets_data["itaration_result"], s=50, c='r', marker="o")
    ax1.set_xlabel('cust_age')
    ax1.set_ylabel('adopted_pet')
    
    #SSE and Accuracy
    print("SSE : ", sum((pets_data["itaration_result"] - y)**2))
    accuracy=f1_score(y, boost_predict, average='micro')
    print("Accuracy : ", accuracy)

In the above output, we can see the actual values are shown as ‘x’ and predicted values are with ‘o’.
In the first iteration, there was much error. Slowly by the end of the 20 th iteration, the predictions
have moved almost on top of the actual values. If we try a lower learning rate then by the end of the
20 th iteration, we will still have much error left.

In [None]:
for i in range (1,102):
    
    #Model and predictions 
    boost_model=GradientBoostingClassifier(n_estimators=i,learning_rate=0.1, max_depth=1)
    boost_model.fit(X,y)
    pets_data["itaration_result"]=boost_model.predict_proba(X)[:,1]
    boost_predict= boost_model.predict(X)
    
    #Graph
    if(np.mod(i, 10) ==1):
        fig = plt.figure()
        plt.rcParams["figure.figsize"] = (7,5)
        plt.title(['learning_rate=0.1', 'Iteration :', i ], fontsize=20)
        ax1 = fig.add_subplot(111)
        ax1.scatter(pets_data["cust_age"],pets_data["adopted_pet"], s=50, c='b', marker="x")
        ax1.scatter(pets_data["cust_age"],pets_data["itaration_result"], s=50, c='r', marker="o")
        ax1.set_xlabel('cust_age')
        ax1.set_ylabel('adopted_pet')
        
    #SSE and Accuracy
    print("SSE : ", sum((pets_data["itaration_result"] - y)**2))
    accuracy=f1_score(y, boost_predict, average='micro')
    print("Accuracy : ", accuracy)

From the above output, we can see the impact of the learning rate parameter. The error reduction
that we could achieve in ten steps with a learning rate of 1 almost took 100 iterations with a learning
rate of 0.1. In general, a slow learning model with a lower learning rate is preferred.

**Case Study- Income Prediction from Census Data**



In [None]:
#income = pd.read_csv(r"/content/drive/My Drive/DataSets/Chapter-7/datasets/Adult_Census_Income/Adult_Income.csv")
income = pd.read_csv(github_link+"/Adult_Census_Income/Adult_Income.csv")

In [None]:
print(income.shape)

In [None]:
print(income.columns)

In [None]:
print(income.info())

In [None]:
all_cols_summary=income.describe()
print(round(all_cols_summary,2))

In [None]:
categorical_vars=income.select_dtypes(include=['object']).columns
print(categorical_vars)

In [None]:
for col in categorical_vars:
    print("\n\nFrequency Table for the column ", col )
    print(income[col].value_counts())

In [None]:
income["workclass"] = income["workclass"].replace(['?','Never-worked','Without-pay'], 'Other')  
print(income["workclass"] .value_counts())

In [None]:
income["marital.status"] = income["marital.status"].replace(['Never-married','Divorced','Separated','Widowed'], 'Not-married')
print(income["marital.status"] .value_counts())

In [None]:
income["occupation"] = income["occupation"].replace(['?'], 'Other-service')  
print(income["occupation"] .value_counts())

In [None]:
freq_country=income["native.country"].value_counts()
less_frequent= freq_country[freq_country <100].index
print(less_frequent)

In [None]:
income["native.country"]=income["native.country"].replace([less_frequent], 'Other')
income["native.country"] = income["native.country"].replace(['?'], 'Other')  
print(income["native.country"].value_counts())

In [None]:
print(income["sex"].value_counts())
income['sex']=income['sex'].map({'Male': 0, 'Female': 1})

In [None]:
print(income["income"].value_counts())
income['income']=income['income'].map({'<=50K': 0, '>50K': 1})

In [None]:
one_hot_cols=['workclass','marital.status','occupation','native.country']
one_hot_data = pd.get_dummies(income[one_hot_cols])
print(one_hot_data.shape)
print(one_hot_data.columns.values)

In [None]:
print(income.shape)
income_final = pd.concat([income, one_hot_data], axis=1)
print(income_final.shape)
print(income_final.info())

In [None]:
one_hot_features=list(one_hot_data.columns.values)
numerical_features=['age',  'education.num', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week']
all_features=one_hot_features+numerical_features
print(all_features)

In [None]:
X=income_final[all_features]
y=income_final['income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
gbm_model1 = GradientBoostingClassifier(learning_rate=0.01, max_depth=4,  n_estimators=100, verbose=1)
gbm_model1.fit(X_train, y_train)

In [None]:
predictions=gbm_model1.predict(X_train)
actuals=y_train
cm = confusion_matrix(actuals,predictions)
print("Confusion Matrix on Train data\n", cm)
accuracy=(cm[0,0]+cm[1,1])/(sum(sum(cm)))
print("Train Accuracy", accuracy)

In [None]:
predictions=gbm_model1.predict(X_test)
actuals=y_test
cm = confusion_matrix(actuals,predictions)
print("Confusion Matrix on Test data\n", cm)
accuracy=(cm[0,0]+cm[1,1])/(sum(sum(cm)))
print("Test Accuracy", accuracy)

In [None]:
for i in range(5,1000, 50):
    gbm_model1 = GradientBoostingClassifier(learning_rate=0.01, max_depth=4,  n_estimators=i)
    gbm_model1.fit(X_train, y_train)
    
    print("N_estimators=" , i)
    #Train data
    predictions=gbm_model1.predict(X_train)
    actuals=y_train
    cm = confusion_matrix(actuals,predictions)
    accuracy=(cm[0,0]+cm[1,1])/(sum(sum(cm)))
    print("Train Accuracy", accuracy)

    #Test data
    predictions=gbm_model1.predict(X_test)
    actuals=y_test
    cm = confusion_matrix(actuals,predictions)
    accuracy=(cm[0,0]+cm[1,1])/(sum(sum(cm)))
    print("Test Accuracy", accuracy)