# Ejercicios ensembling
En este ejercicio vas a realizar prediciones sobre un dataset de ciudadanos indios diabéticos. Se trata de un problema de clasificación en el que intentaremos predecir 1 (diabético) 0 (no diabético). Todas las variables son numércias.

## 1. Carga las librerias que consideres comunes al notebook

In [1]:
import pandas as pd
import numpy as np

## 2. Lee los datos de [esta direccion](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)
Los nombres de columnas son:
```Python
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
```

In [3]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context  

In [4]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url, names=names)
df

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## 3. Bagging
Para este apartado tendrás que crear un ensemble utilizando la técnica de bagging ([BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)), mediante la cual combinarás 100 [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Recuerda utilizar también [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) con 10 kfolds.

**Para este apartado y siguientes, no hace falta que dividas en train/test**, por hacerlo más sencillo. Simplemente divide tus datos en features y target.

Establece una semilla

In [5]:
array = df.values
X = array[:,0:8]
Y = array[:,8]
seed = 7

In [7]:
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

kfold = model_selection.KFold(n_splits=10)
cart = DecisionTreeClassifier()
num_trees = 100
model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)
results_bagg = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()
results_bagg

0.7720437457279563

## 4. Random Forest
En este caso entrena un [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) con 100 árboles y un `max_features` de 3. También con validación cruzada

In [8]:
import pandas
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier

num_trees = 100
max_features = 3
kfold = model_selection.KFold(n_splits=10)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results_rf = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()
results_rf

0.775974025974026

## 5. AdaBoost
Implementa un [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) con 30 árboles.

In [9]:
# AdaBoost Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier

num_trees = 30
kfold = model_selection.KFold(n_splits=10)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results_ada = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()
results_ada

0.760457963089542

A result of 0.76 for ada when using AdaBoostClassifier indicates that the cross-validated accuracy of the AdaBoostClassifier model on the given data is 76%.

Specifically:

AdaBoostClassifier is an ensemble method that combines multiple weak decision tree classifiers into a strong classifier.

The model was trained on the provided input data (X) and labels (Y).

Cross-validation was used to evaluate the model's accuracy. The data was split into 10 folds, the model was trained on 9 folds and tested on the remaining fold. This was repeated for each fold.

The accuracy of the model was calculated on each fold. These 10 accuracy values were then averaged to produce the overall cross-validated accuracy score.

This final cross-validated accuracy score was 0.76 or 76%.

So in summary, a score of 0.76 means the AdaBoostClassifier was able to correctly classify, on average, 76% of the examples in a held-out fold, when evaluated across the 10 folds of cross-validation.

This gives a realistic estimate of the model's generalization performance on new unseen data. A score of 0.76 is reasonably good for many classification problems. But whether it is acceptable or not depends on the specific use case and requirements. Overall, the cross-validated accuracy gives a reliable metric to evaluate and tune the AdaBoostClassifier for a given dataset.



 

## 6. GradientBoosting
Implementa un [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) con 100 estimadores

It takes in two main inputs:

n_estimators - This specifies the number of decision trees to use in the ensemble model. It is set to the trees variable, which we can assume contains the desired number of trees.

random_state - This sets the random seed for reproducibility.

The output is a configured GradientBoostingClassifier model object that can be used for training and making predictions.

The GradientBoostingClassifier model works by combining the predictions from multiple decision tree models. It trains the decision trees sequentially, each one learning from the errors of the previous tree. Setting a higher n_estimators results in more trees being used, which can improve accuracy at the cost of increased training time.

The random_state parameter seeds the random number generator used while training the trees. This ensures repeatable results across multiple runs with the same data.

By instantiating the GradientBoostingClassifier with these parameters, the code configures an ensemble model ready for training on data. The trained model can then be used to make predictions on new data by combining the predictions of its constituent decision trees.

In [10]:
import pandas
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingClassifier

num_trees = 100
kfold = model_selection.KFold(n_splits=10)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results_gb = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()
results_gb

0.7681989063568012

## 7. XGBoost
Para este apartado utiliza un [XGBoostClassifier](https://docs.getml.com/latest/api/getml.predictors.XGBoostClassifier.html) con 100 estimadores. XGBoost no forma parte de la suite de modelos de sklearn, por lo que tendrás que instalarlo con pip install

XGBClassifier

The XGBClassifier is a classifier model that is part of the XGBoost library. It is used for building gradient boosted decision tree models for classification tasks.

It takes as input the hyperparameters for configuring the model such as the number of estimators (n_estimators) which is the number of trees to build in the ensemble model. It also takes the training data features (X) and labels (y) to fit the model.

The XGBClassifier is initialized by specifying the model hyperparameters like n_estimators. The .fit() method is used to train the model on the provided training data. This builds an ensemble of decision trees on the training data.

The trained model can then be used to make predictions on new unseen data using the .predict() method. This will output the predicted class labels.

Some key aspects of XGBClassifier:

Ensemble method that combines multiple decision trees to improve accuracy.

Uses gradient boosting to incrementally build trees that focus on hard to classify examples.

Handles sparse data well and provides good accuracy with default parameters.

Fast prediction speed and high performance compared to other tree models.

Regularization helps prevent overfitting the training data.

So in summary, it is a gradient boosted decision tree classifier that builds an ensemble model useful for classification tasks and can achieve high accuracy without too much tuning.





In [11]:
#!pip install xgboost

In [12]:
from sklearn.model_selection import train_test_split

# Split data into features (X) and labels (y)
y = df['class'] 
X = df.drop('class', axis=1)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.model_selection import train_test_split

# Split data into features (X) and labels (y)
y = df['class'] 
X = df.drop('class', axis=1)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define hyperparameter space
param_grid = {'C': [0.1, 1, 10, 100], 
              'gamma': [0.001, 0.01, 0.1, 1]}

# Create grid search object
grid = GridSearchCV(SVC(), param_grid, cv=5)

# Fit on training data
grid.fit(X_train, y_train) 

# View best hyperparameters 
print(grid.best_params_)

# Build model with best hyperparameters
model = SVC(C=grid.best_params_['C'], 
            gamma=grid.best_params_['gamma'])

# Refit on full training set
model.fit(X_train, y_train)
from xgboost import XGBClassifier

kfold = model_selection.KFold(n_splits=10)
model = XGBClassifier(n_estimators=100)
results_xgb = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()
print(results_xgb)

0.7395591250854407


## 8. Resultados
Crea un series con los resultados y sus algoritmos, ordenándolos de mayor a menor

In [22]:
resul = [results_bagg, results_rf, results_ada, results_gb, results_xgb]
algori = ["Bagging DT", "Random Forest", "Ada Boost", "GradientBoosting", "XGBoost"]

resultados = pd.Series(resul, algori).sort_values(ascending=False)
resultados

Random Forest       0.775974
Bagging DT          0.772044
GradientBoosting    0.768199
Ada Boost           0.760458
XGBoost             0.739559
dtype: float64