## Bagging Classifier

In [70]:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

In [71]:
X,y = make_classification(n_samples=10000, n_features=10,n_informative=3)

In [72]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [73]:
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)

print("Decision Tree accuracy",accuracy_score(y_test,y_pred))

Decision Tree accuracy 0.917


- Here we have created a toy dataset with 10,000 rows and 10 columns.
- Then, we split it so 8,000 rows are used for training and the remaining for testing.
- This is done to get an idea of the accuracy of this dataset if we run it on a single decision tree model.
- This baseline performance will allow us to compare if performance improves when we use bagging.


## Bagging

In [74]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    random_state=42
)

- So, we have kept the base estimator as a decision tree classifier.
- The number of estimators is 500, i.e., 500 decision trees.
- Max samples is 0.25, meaning in training we had 8k rows; 25% of that is 2k, so we will be using 2k rows.
- Each of the 500 decision trees, our base models, will be given 2,000 rows for training.
- Bootstrap is set to true, meaning it's sampling with replacement.


In [75]:
bag.fit(X_train,y_train)


In [76]:
y_pred = bag.predict(X_test)

In [77]:
print("Bagging using DT",accuracy_score(y_test,y_pred))

Bagging using DT 0.949


- The accuracy of bagging is greater than that of a single decision tree.
- We can also see how many rows our base model got by writing the following code:
  - `bag` is our object.
  - `bag.estimators_samples_` will give us a big list inside a list in numpy arrays. It mentions which row number each base model has received.
  - The first array signifies the first decision tree in my bagging and shows which rows it received by their index numbers.
  - The output shows 2000, meaning our first base model received 2000 rows.


In [78]:
bag.estimators_samples_[0].shape

(2000,)

In [79]:
bag.estimators_features_[0].shape

(10,)

## Bagging using SVM

In [80]:
bag = BaggingClassifier(
    estimator=SVC(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    random_state=42
)

In [81]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Bagging using SVM",accuracy_score(y_test,y_pred))

Bagging using SVM 0.8985


- with svm accuracy is 91.5% so decision tree gaved better performance

## Pasting

In [82]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=False,   # so here will do without replacement
    random_state=42,
    verbose = 1,  # Sets verbosity level; 1 means the progress messages will be printed.
    n_jobs=-1 # Uses all available CPU cores for parallel processing.
)



In [83]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Pasting classifier using DT",accuracy_score(y_test,y_pred))

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    5.6s remaining:   17.0s
[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    5.8s finished
[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   2 out of   8 | elapsed:    0.1s remaining:    0.4s


Pasting classifier using DT 0.9495


[Parallel(n_jobs=8)]: Done   8 out of   8 | elapsed:    0.2s finished


- The messages indicate that the computation is using 8 CPU cores in parallel, managed by the `LokyBackend`. The progress messages show the completion status and estimated time for each set of tasks.
- A core is a processing unit within your computer's CPU. Modern CPUs have multiple cores, which allow them to handle multiple tasks simultaneously, improving performance for parallelizable operations.

- In this case, pasting with Decision Tree (DT) is yielding better results than bagging with DT.


## Random Subspaces

In [84]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=1.0, # i.e i am using 100% of training sample i.e 8000 rows
    bootstrap=False,
    max_features=0.5, # i.e out of 10 will use any 5 features
    bootstrap_features=True, # will do only column sampling
    random_state=42
)


In [85]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Random Subspaces classifier",accuracy_score(y_test,y_pred))

Random Subspaces classifier 0.925


In [86]:
bag.estimators_samples_[0].shape
# will get 8000 since we are sending all rows

(8000,)

In [87]:
bag.estimators_features_[0].shape


(5,)

## Random Patches

In [88]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    max_features=0.5,
    bootstrap_features=True,
    random_state=42
)
# here will do both column and row sampling

In [89]:
bag.fit(X_train,y_train)
y_pred = bag.predict(X_test)
print("Random Patches classifier",accuracy_score(y_test,y_pred))

Random Patches classifier 0.918


## OOB Score

- **Out-of-Bag Samples**: When we perform row sampling with replacement, some rows may not be included in the training data for certain decision trees. Conversely, some rows may be included in multiple decision trees' training data due to replacement.
  
- Statistically, approximately 63% of samples are used in training, while about 37% remain out-of-bag, i.e., they are not used for training any of the decision trees. This is why these samples are referred to as out-of-bag samples.

- We can use these out-of-bag samples to evaluate the performance of the model. To do this, we need to set the `oob_score` parameter to `True`.

- After training the BaggingClassifier with `oob_score=True`, we can access the out-of-bag score using `bag.oob_score_` to get the accuracy on these out-of-bag samples.


In [90]:
bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=0.25,
    bootstrap=True,
    oob_score=True,
    random_state=42
)

In [91]:
bag.fit(X_train,y_train)

In [92]:
bag.oob_score_

0.943125

In [93]:
y_pred = bag.predict(X_test)
print("Accuracy",accuracy_score(y_test,y_pred))

Accuracy 0.949


## Summary

- **Bagging vs. Pasting**: Bagging typically yields better results compared to Pasting.

- **Row Sampling**: Optimal performance is generally observed with 25% to 50% row sampling.

- **High Dimensional Data**: For datasets with high dimensions, such as images or text, using Random Patches and Random Subspaces is recommended.

- **Hyperparameter Tuning**: To determine the best hyperparameters, techniques like GridSearchCV or RandomSearchCV can be employed.


## Applying GridSearchCV


In [94]:
from sklearn.model_selection import GridSearchCV

In [95]:
# parameters = {
#     'n_estimators': [50,100,500], 
#     'max_samples': [0.1,0.4,0.7,1.0],
#     'bootstrap' : [True,False],
#     'max_features' : [0.1,0.4,0.7,1.0]
#     }

In [98]:
# search = GridSearchCV(BaggingClassifier(), parameters, cv=5)

In [97]:
# search.fit(X_train,y_train)

## Classifier Performance Summary

| **Classifier**              | **Accuracy** |
|-----------------------------|--------------|
| Decision Tree (DT)          | 0.917        |
| Bagging with DT             | 0.949        |
| Bagging with SVM            | 0.8985       |
| Pasting with DT             | 0.9495       |
| Random Subspaces with DT    | 0.925        |
| Random Patches with DT      | 0.918        |
| Out-of-Bag (OOB) Score (DT) | 0.943125     |
| Test Accuracy (OOB DT)      | 0.949        |


## Conclusion

Based on the accuracy results obtained for various classifiers and methods on our dataset:

- **Decision Tree (DT)**: Achieved an accuracy of 0.917. This serves as our baseline performance for comparison.
- **Bagging with DT**: Improved accuracy to 0.949, indicating that bagging generally enhances performance compared to a single decision tree.
- **Bagging with SVM**: Delivered an accuracy of 0.8985, which, while improved from a single SVM model, was lower than the performance with decision trees.
- **Pasting with DT**: Slightly outperformed bagging with DT with an accuracy of 0.9495. This suggests that pasting can be more effective than bagging with decision trees for our dataset.
- **Random Subspaces with DT**: Achieved an accuracy of 0.925. This method effectively balances feature sampling and demonstrated robust performance.
- **Random Patches with DT**: Provided an accuracy of 0.918, showing that combining both row and column sampling yields competitive results.
- **Out-of-Bag (OOB) Score with DT**: Recorded an OOB score of 0.943125, which is close to the test accuracy of 0.949. This indicates that the OOB samples are a good representation of the model’s performance.

**Summary**: For our dataset, bagging with decision trees performed very well, achieving high accuracy. Pasting with decision trees slightly outperformed bagging. Random subspaces and random patches also showed strong performance, but were not as effective as the best bagging and pasting methods. The OOB score provided a reliable estimate of model performance. Overall, decision tree-based methods with bagging or pasting are effective for this dataset, with minor variations depending on the specific sampling techniques used.
