# Ensemble Methods (Work-in-Progress)

Ensemble Methods combine multiple models to give better performance. There are two main ways to do this:
- **Averaging and Voting**: take the average predictions of different models or hold a majority vote.
- **Boosting**: iteratively train additional models, targeting samples that have not been predicted correctly in previous steps.

<!--https://towardsdatascience.com/lightgbm-vs-xgboost-which-algorithm-win-the-race-1ff7dd4917d-->

In [30]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

imdb_train = pd.read_csv("../Data/imdb_train.csv",
                         names=['label','text'])
imdb_test = pd.read_csv("../Data/imdb_test.csv",
                         names=['label','text'])

# Target
y_train = imdb_train['label']
y_test = imdb_test['label']

# Features
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(imdb_train['text'])
X_test = vectorizer.transform(imdb_test['text'])

In [135]:
model = LogisticRegression(C=0.11,max_iter=1000)
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))

0.999
0.795


In [135]:
model = LogisticRegression(C=0.11,max_iter=1000)
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))

0.999
0.795


In [133]:
model = KNeighborsClassifier(n_neighbors=25)
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))

0.658
0.604


In [105]:
model = LinearSVC(C=0.008)
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))

0.999
0.791


## A. Averaging Methods

Averaging methods take the average predictions of different models or hold a majority vote.
The attraction of these methods is they work with all models, including highly complex ones. The downside is, they do not take additional steps to address samples in which the overall model is not performing well.


### Regression - Averaging

For regression task, a simple way to construct an ensemble is to use the average prediction of multiple models. 

You can combine pretty much any number of models with `sklearn.ensemble.VotingRegressor`. Simply provide a list of estimators when you create the ensemble:
```python
VotingClassifier(estimators=[
    (name_1, model_1),
    (name_2, model_2),
    ...
    ],
    weights)                 
```
`weights` is an optional list specifying the weight each model should carry. Default is equal weight.



In [None]:
from sklearn.ensemble import VotingRegressor




### Classification - Voting

For classification, a simple way to construct an ensemble is to have multiple models vote for the prediction. There are two main ways to do this:
- **Majority voting**: use the most common prediction among all included models as the final prediction. 
- **Soft voting**: sum the predicted *probabilities* of all included models and use that to make the final prediction.

Similar to regression task, You can combine pretty much any number of models with `sklearn.ensemble.VotingClassifier`. Simply provide a list of estimators when you create the ensemble:
```python
VotingClassifier(estimators=[
    (name_1, model_1),
    (name_2, model_2),
    ...
    ],
    voting, weights)                 
```

Besides setting `weights`, for classification you can also specify the voting method. Set `voting` to 'hard' for majority voting and 'soft' for soft voting. Default is 'hard'.


In [136]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC

model_1 = LogisticRegression(C=0.11,max_iter=1000)
model_2 = KNeighborsClassifier(n_neighbors=25)
model_3 = LinearSVC(C=0.008)

ensemble = VotingClassifier(estimators=[
                            ('logit', model_1),    
                            ('knn', model_2),
                            ('svc',model_3)]
                           )
ensemble.fit(X_train,y_train)
print(ensemble.score(X_train,y_train))
print(ensemble.score(X_test,y_test))


0.999
0.795


### Bagging

Instead of having different model classes, we could alternatively train multiple copies of the same model on different samples of data. The typical way to do this is to *bootstrap* the training data&mdash;in other words, resampling the data with replacement.

The scikit-learn classes for bagging are `sklearn.ensemble.BaggingRegressor` and `sklearn.ensemble.BaggingClassifier`.

In [195]:
from sklearn.ensemble import BaggingClassifier
ensemble = BaggingClassifier(base_estimator=model_1,n_jobs=4)
ensemble.fit(X_train,y_train)
print(ensemble.score(X_train,y_train))
print(ensemble.score(X_test,y_test))

0.992
0.782


### Random Forest

A particularly important averaging model is random forest, which is an ensemble of decision trees. As we have seen previously, decision trees are prone to overfitting due to their ability to partition data down to each individual sample. Random forest overcome this problem in two ways:
1. Use soft voting from a large number of trees.
2. Train each tree on a bootstrapped sample of the original data.

In [158]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_jobs=4)
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))

1.0
0.783


## B. Boosting Methods
Boosting methods iteratively train additional models, targeting samples that have not been predicted correctly in previous steps. The models boosting methods use are typically simple&mdash;or *weak*&mdash;such as decision tree with limited branches. The power of these methods comes from combining multiple weak models with targeted training.

Among the various boosting methods, gradient boosting is the most popular. So popular, in fact, that it is considered the go-to method before you consider using artificial neural networks. There are four main gradient boosting implementations:
- Scikit-learn's `GradientBoostingClassifier`: simple baseline implementation, not really used in practice.
- `xgboost`: most well known.
- `lightgbm`: fastest.
- `catboost`: good defaults.


In [188]:
from xgboost import XGBClassifier

model = XGBClassifier(max_depth=1,learning_rate=0.35,n_jobs=4)
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))

0.886
0.775


In [189]:
from lightgbm import LGBMClassifier

model = XGBClassifier(max_depth=1,learning_rate=0.35,n_jobs=4)
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))

0.886
0.775


In [192]:
from catboost import CatBoostClassifier

model = CatBoostClassifier(max_depth=1,learning_rate=0.35,thread_count=4)
model.fit(X_train,y_train,verbose=False)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))

1.0
0.784
