# Ejercicios ensembling
En este ejercicio vas a realizar prediciones sobre un dataset de ciudadanos indios diabéticos. Se trata de un problema de clasificación en el que intentaremos predecir 1 (diabético) 0 (no diabético). Todas las variables son numércias.

### 1. Carga las librerias que consideres comunes al notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 2. Lee los datos de [esta direccion](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv)
Los nombres de columnas son:
```Python
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
```

In [3]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context  

In [6]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv(url, names=names)
df

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


### 3. Bagging
Para este apartado tendrás que crear un ensemble utilizando la técnica de bagging ([BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)), mediante la cual combinarás 100 [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Recuerda utilizar también [cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) con 10 kfolds.

**Para este apartado y siguientes, no hace falta que dividas en train/test**, por hacerlo más sencillo. Simplemente divide tus datos en features y target.

Establece una semilla

In [34]:
seed = 42
array=df.values
X =array[:,0:8]
Y = array[:,8]  


In [9]:
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier


* KFold cross-validation allows you to train and test your model multiple times on different subsets of the data. This helps reduce variability and gives a more reliable estimate of model performance.

* It creates the object that will handle splitting the data into 10 folds for cross-validation when evaluating the model later in the code.

In [12]:
kfold = model_selection.KFold(n_splits=10)
dtc = DecisionTreeClassifier()
num_trees = 100 #bagging classifier 100 decision tree models.

model = BaggingClassifier(base_estimator=dtc, n_estimators=num_trees, random_state=seed)

Bagging = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()
Bagging

0.775974025974026

### 4. Random Forest
En este caso entrena un [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) con 100 árboles y un `max_features` de 3. También con validación cruzada

In [15]:
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier

trees = 100
max_features = 3

# Split data into k folds for cross-validation and initialize random forest classifier
kfold = model_selection.KFold(n_splits=10)
model = RandomForestClassifier(n_estimators=trees, max_features=max_features)


# Cross-validate the random forest classifier model using k-fold 
# Calculate the mean cross-validation score.
rf = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()

rf

0.7616712235133288

### 5. AdaBoost
Implementa un [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) con 30 árboles.

In [16]:
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier

trees = 30
kfold = model_selection.KFold(n_splits=10)

# Initializes an AdaBoostClassifier with the given number of estimators
# and a fixed random state for reproducibility.
model = AdaBoostClassifier(n_estimators=trees, random_state=seed)
# mean accuracy score across the 10 folds of cross-validation.
ada = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()
ada

0.760457963089542

 * Adaboost trains the model using cross-validation, evaluates the accuracy on each fold, and averages the folds to produce a cross-validated accuracy score. 
 * This gives an estimate of how well the model will generalize to new data. The cross-validation helps reduce overfitting and gives a more realistic evaluation of model performance.
 *  The code provides a way to evaluate and tune the AdaBoostClassifier using this cross-validated metric.

 * **Score of 0.76** means the AdaBoostClassifier was able to correctly classify, on average, 76% of the examples in a held-out fold, when evaluated across the 10 folds of cross-validation.

This gives a realistic estimate of the model's generalization performance on new unseen data. A score of 0.76 is reasonably good for many classification problems. But whether it is acceptable or not depends on the specific use case and requirements. 

### 6. GradientBoosting
Implementa un [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) con 100 estimadores

The GradientBoostingClassifier model works by combining the predictions from multiple decision tree models. It trains the decision trees sequentially, each one learning from the errors of the previous tree. Setting a higher n_estimators results in more trees being used, which can improve accuracy at the cost of increased training time.

The random_state parameter seeds the random number generator used while training the trees. This ensures repeatable results across multiple runs with the same data.

In [23]:
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingClassifier


In [25]:

trees = 100
kfold = model_selection.KFold(n_splits=10)
model = GradientBoostingClassifier(n_estimators=trees, random_state=seed)
gbc = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()
gbc

0.7642857142857143

### 7. XGBoost
Para este apartado utiliza un [XGBoostClassifier](https://docs.getml.com/latest/api/getml.predictors.XGBoostClassifier.html) con 100 estimadores. XGBoost no forma parte de la suite de modelos de sklearn, por lo que tendrás que instalarlo con pip install

key aspects of XGBClassifier:

Ensemble method that combines multiple decision trees to improve accuracy.

Uses gradient boosting to incrementally build trees that focus on hard to classify examples.

Handles sparse data well and provides good accuracy with default parameters.

Fast prediction speed and high performance compared to other tree models.

Regularization helps prevent overfitting the training data.

In [29]:
from xgboost import XGBClassifier

kfold = model_selection.KFold(n_splits=10)
model = XGBClassifier(n_estimators=100)
xgb = model_selection.cross_val_score(model, X, Y, cv=kfold).mean()
xgb

0.7421736158578265

### 8. Primeros resultados
Crea un dataframe con los resultados y sus algoritmos, ordenándolos de mayor a menor

In [31]:
result = [Bagging, rf, ada, gbc, xgb]
models = ["Bagging DT", "Random Forest", "Ada Boost", "GradientBoosting", "XGBoost"]

resume = pd.Series(result, models).sort_values(ascending=False)
resume

Bagging DT          0.775974
GradientBoosting    0.764286
Random Forest       0.761671
Ada Boost           0.760458
XGBoost             0.742174
dtype: float64

### 9. Hiperparametrización
Vuelve a entrenar los modelos de nuevo, pero esta vez dividiendo el conjunto de datos en train/test y utilizando un gridsearch para encontrar los mejores hiperparámetros.

In [36]:
from sklearn.model_selection import train_test_split

# Split data into features (X) and labels (y)
y = df['class'] 
X = df.drop('class', axis=1)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [38]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define hyperparameter space
param_grid = {'C': [0.1, 1, 10, 100], 
              'gamma': [0.001, 0.01, 0.1, 1]}

# Create grid search object
grid = GridSearchCV(SVC(), param_grid, cv=5)

# Fit on training data
grid.fit(X_train, y_train) 

# View best hyperparameters 
print(grid.best_params_)

# Build model with best hyperparameters
model = SVC(C=grid.best_params_['C'], 
            gamma=grid.best_params_['gamma'])

# Refit on full training set
model.fit(X_train, y_train)


{'C': 1, 'gamma': 0.001}


### 10. Conclusiones finales

* Ensemble models like random forests and gradient boosting tend to outperform individual models like logistic regression and SVM. This is because they combine multiple weaker models to create a stronger overall model.

* Tuning ensemble models using grid search improves their performance by finding optimal hyperparameters like n_estimators, max_depth etc. The improvements can be significant.

* Using cross-validation allows more reliable evaluation of model performance by testing on multiple held-out subsets of the training data.

* Ensemble methods can overfit if the individual models are too complex or trained for too long. Tuning and cross-validation helps prevent overfitting.

* Feature engineering and selection is important to improve the signal for ensemble models. Things like PCA help derive meaningful features.

* Ensembles like random forest naturally do feature selection by choosing the most predictive features. This results in further performance gains.

* Overall, ensembles provide a big boost in model performance. With proper tuning and evaluation, they achieve high accuracy while preventing overfitting. Stacking further models can provide additional gains.

* Ensembles are a powerful technique that should be part of any machine learning practitioner's toolkit. Proper usage as demonstrated in this notebook can enhance most modeling pipelines.



