# Ensemble Learning

Random forest are just one example of ensemble learning. Just means to use multiple models to solve the same problem, and let them vote on the result

- Random forest uses bagging to implement ensemble learning
    - many models are build by training on randomly-drawn subsets of the data
- *Boosting* is an alternative technique where each subsequent model in the ensenble boosts attributes that address mis-classified by the previous model
- A *bucket of models* tains several different models using training data, and picks the one that work best withe the test data
- *Stacking* runs multiple models at once on the data, and combines the results together

## Advanced Ensemble Learning:
- Bayes Optimal Classifier
- Bayesian Parameter Averaging
- Bayesian Model Combination

## XGBoost
- eXtreme Gradient Boosted Trees
- Each tree boosts attributes that led to missclassifications of previous trees

## Features of XGBoost
- Regularized boosting (prevent overfitting)
- Can handle missing values automatically
- Parallel processing
- Can cross-validate at each iteration
  - Enables early stopping, finding optimal number of iterations
- Incremental training
- Can plug in your own optimization objectives
- Tree pruning
  - Generally results in deeper, but optimized, trees
 
## XGBoost Hyperparameters
- Booster
    - gbtree or gblinear
- Objective (multi:softmax, multi:softprob)
- Eta (learning rate - Adjusts weights on each step)
- Max_depth (depth of the tree)
- Min_child_weight
    - Can control overfitting, but not too high will underfit


# Code

In [2]:
from sklearn.datasets import load_iris

# Carrega o conjunto de dados Iris
iris = load_iris()

# Obtém o número de amostras e o número de características
numSamples, numFeatures = iris.data.shape

# Exibe informações do dataset
print(numSamples)
print(numFeatures)
print(list(iris.target_names))

150
4
['setosa', 'versicolor', 'virginica']


In [3]:
# Importa as bibliotecas
from sklearn.model_selection import train_test_split

# Divide em conjuntos de treino e teste
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=0)

In [4]:
# Importa o xgboost
import xgboost as xgb

# # Cria o DMatrix para o conjunto de treino e teste com os dados e rótulos
train = xgb.DMatrix(X_train, label=y_train)
test = xgb.DMatrix(X_test, label=y_test)

In [5]:
# Definição dos hiperparâmetros para o modelo XGBoost
param = {
    'max_depth': 4,                # Profundidade máxima das árvores de decisão
    'eta': 0.3,                    # Taxa de aprendizado
    'objective': 'multi:softmax',  # Tipo de problema (classificação multiclasse com saída discreta)
    'num_class': 3                 # Número de classes (para o dataset Iris, são 3)
}
# Número de épocas (iterações de treinamento)
epochs = 10

In [6]:
# Treina o modelo
model = xgb.train(param, train, epochs)

In [7]:
# Realiza predições com o conjunto de testes
predictions = model.predict(test)

In [8]:
# Printa as predições
print(predictions)

[2. 1. 0. 2. 0. 2. 0. 1. 1. 1. 2. 1. 1. 1. 1. 0. 1. 1. 0. 0. 2. 1. 0. 0.
 2. 0. 0. 1. 1. 0.]


In [9]:
# Mede a precisão do modelo
from sklearn.metrics import accuracy_score

accuracy_score(y_test, predictions)

1.0

## Activity

In [10]:
# Metade dos epochs
param = {
    'max_depth': 4,
    'eta': 0.3,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 5

model = xgb.train(param, train, epochs)

predictions = model.predict(test)
accuracy_score(y_test, predictions)

1.0

In [11]:
# Metade dos epochs e metade do learning rate
param = {
    'max_depth': 4,
    'eta': 0.15,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 5

model = xgb.train(param, train, epochs)

predictions = model.predict(test)
accuracy_score(y_test, predictions)

1.0

In [12]:
# Metade dos epochs e metade do learning 
# Metade da profundidade maxima
param = {
    'max_depth': 2,
    'eta': 0.15,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 5

model = xgb.train(param, train, epochs)

predictions = model.predict(test)
accuracy_score(y_test, predictions)

0.9666666666666667

In [18]:
# 1 epoch e 1/1000 do learning rate
param = {
    'max_depth': 4,
    'eta': 0.0003,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 1

model = xgb.train(param, train, epochs)

predictions = model.predict(test)
accuracy_score(y_test, predictions)

1.0

In [36]:
# Learning rate em 0.00000003
param = {
    'max_depth': 4,
    'eta': 0.00000003,
    'objective': 'multi:softmax',
    'num_class': 3} 
epochs = 10

model = xgb.train(param, train, epochs)

predictions = model.predict(test)
accuracy_score(y_test, predictions)

0.9333333333333333