### **📝 Instructions**
#### **Predicting diabetes**
In the two previous projects we saw how we could use a decision tree and then a random forest to improve the prediction of diabetes. We have reached a point where we need to improve. Can boosting be the best alternative to optimize the results?

Boosting is a sequential composition of models (usually decision trees) in which the new model aims to correct the errors of the previous one. This view may be useful in this data set, since several of the assumptions studied in the module are met.

In this project you will focus on this idea by training the dataset to improve the `accuracy`


Remember that previous projects can be found [here](https://github.com/rodri-iot/machine_learning_project_decision_tree_and_random_forest/blob/main/src/app_dt.ipynb) (decision trees) and [here](https://github.com/rodri-iot/machine_learning_project_decision_tree_and_random_forest/blob/main/src/app_random_forest.ipynb) (random forest).

In [1]:
# Import libreries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import (MinMaxScaler,
                                   StandardScaler,
                                   LabelEncoder,
                                   OneHotEncoder)
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import (chi2,
                                       SelectKBest,
                                       f_regression)
from sklearn.model_selection import (train_test_split,
                                     GridSearchCV) # For Optimize
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.linear_model import (LogisticRegression,
                                  Lasso)
from sklearn.metrics import (accuracy_score,
                             mean_squared_error,
                            confusion_matrix,
                            classification_report)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from xgboost import XGBClassifier

# Optimize
from pickle import dump

#### **Step 1: Loading the dataset**
Loads the processed dataset from the previous project (split into training and test samples and analyzed with EDA).

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/decision-tree-project-tutorial/main/diabetes.csv")
df.info()
# Create DF into ../data/raw
df_raw = df.copy()
df_raw.to_csv("../data/raw/df_raw.csv", index= False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [3]:
# Preprocessing data
df_interim = (
    df_raw
        .copy()
        .set_axis(
            df_raw.columns.str.replace(' ','_')
                          .str.replace(r'r\W', '', regex=True)
                          .str.lower()
                          .str.slice(0, 40), axis=1
        )
        .drop_duplicates().reset_index(drop=True)
)
df_interim.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   pregnancies               768 non-null    int64  
 1   glucose                   768 non-null    int64  
 2   bloodpressure             768 non-null    int64  
 3   skinthickness             768 non-null    int64  
 4   insulin                   768 non-null    int64  
 5   bmi                       768 non-null    float64
 6   diabetespedigreefunction  768 non-null    float64
 7   age                       768 non-null    int64  
 8   outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [4]:
df_interim.to_csv("../data/interim/df_interim.csv")
# Split DF
df = df_interim.copy()

df_train, df_test = train_test_split(df, test_size=0.2, random_state=2024)

In [5]:
display(df_train.describe().T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
pregnancies,614.0,3.801303,3.341801,0.0,1.0,3.0,6.0,17.0
glucose,614.0,121.40228,32.625455,0.0,99.0,118.0,141.75,199.0
bloodpressure,614.0,69.456026,20.278257,0.0,64.0,72.0,80.0,122.0
skinthickness,614.0,20.716612,15.769687,0.0,0.0,23.0,32.75,63.0
insulin,614.0,80.947883,115.687775,0.0,0.0,34.0,125.0,744.0
bmi,614.0,32.088111,8.109101,0.0,27.5,32.4,36.6,67.1
diabetespedigreefunction,614.0,0.47899,0.34491,0.085,0.238,0.365,0.6485,2.42
age,614.0,33.100977,11.722438,21.0,24.0,29.0,40.0,81.0
outcome,614.0,0.346906,0.476373,0.0,0.0,0.0,1.0,1.0


In [6]:
X_train = df_train.drop('outcome', axis=1).reset_index(drop=True)
y_train = df_train['outcome'].reset_index(drop=True)
X_test = df_test.drop('outcome', axis=1).reset_index(drop=True)
y_test = df_test['outcome'].reset_index(drop=True)

select_model = SelectKBest(k=7)
select_model.fit(X_train, y_train)

select_cols = X_train.columns[select_model.get_support()]
X_train_sel = pd.DataFrame(select_model.transform(X_train), columns=select_cols)
X_test_sel = pd.DataFrame(select_model.transform(X_test), columns=select_cols)

In [7]:
X_train_sel["outcome"] = y_train.values
X_test_sel["outcome"] = y_test.values
X_train_sel.to_csv("../data/processed/df_train_clean.csv", index = False)
X_test_sel.to_csv("../data/processed/df_test_clean.csv", index = False)

#### **Step 2: Build a boosting**
One way to optimize and improve the results is to generate a boosting so that there is the necessary variety to enrich the prediction. Train it and analyze its results. Try modifying the hyperparameters that define the model with different values and analyze their impact on the final accuracy and plot the conclusions.

In [8]:
df_train = pd.read_csv("../data/processed/df_train_clean.csv")
df_test = pd.read_csv("../data/processed/df_test_clean.csv")

In [9]:
display(df_train.head())
display(df_test.head())

Unnamed: 0,pregnancies,glucose,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
0,0.0,111.0,0.0,0.0,24.6,0.66,31.0,0
1,5.0,136.0,41.0,88.0,35.0,0.286,35.0,1
2,2.0,74.0,0.0,0.0,0.0,0.102,22.0,0
3,2.0,141.0,34.0,128.0,25.4,0.699,24.0,0
4,4.0,110.0,20.0,100.0,28.4,0.118,27.0,0


Unnamed: 0,pregnancies,glucose,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
0,5.0,155.0,44.0,545.0,38.7,0.619,34.0,0
1,3.0,158.0,13.0,387.0,31.2,0.295,24.0,0
2,1.0,189.0,23.0,846.0,30.1,0.398,59.0,1
3,2.0,146.0,38.0,360.0,28.0,0.337,29.0,1
4,0.0,78.0,29.0,40.0,36.9,0.434,21.0,0


In [10]:
X_train = df_train.drop('outcome', axis=1).reset_index(drop=True)
y_train = df_train['outcome'].reset_index(drop=True)
X_test = df_test.drop('outcome', axis=1).reset_index(drop=True)
y_test = df_test['outcome'].reset_index(drop=True)

In [11]:
clf_bm = XGBClassifier(learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.7)
clf_bm.fit(X_train, y_train)

In [12]:
y_pred = clf_bm.predict(X_test)
y_pred

array([1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0])

In [13]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.7467532467532467


In [14]:
# Optimized
hyperparams = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5],
    'subsample': [0.7, 0.8, 1.0]
}

grid_search = GridSearchCV(clf_bm,
                           param_grid=hyperparams, 
                           scoring='neg_mean_squared_error',  # Métrica de evaluación
                           cv=5,  # Número de folds para validación cruzada
                           verbose=1,  # Muestra el progreso de la búsqueda
                           n_jobs=-1)  # Usar todos los núcleos del CPU)


In [15]:
grid_search.fit(X_train, y_train)

print("Mejores parámetros: ", grid_search.best_params_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


Mejores parámetros:  {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.7}


In [16]:
# Evaluate the model on the test set with the best parameters
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error en el conjunto de prueba: {mse}")

Mean Squared Error en el conjunto de prueba: 0.2532467532467532


#### **Step 3: Save the model**
Store the model in the corresponding folder.

In [17]:
dump(clf_bm, open("../models/boosting_classifier_nestimatros-100_leanrate-0.01_max_depth-3_subsample-0.7.sav", "wb"))

#### **Step 4: Analyze and compare model results**
Make a study now of the three models used, analyze their predictions, the class with the highest prediction accuracy and the one with the lowest. Which of the three models do you choose?

1. Decision Tree Accuracy_score = 0.7597:

    - This is the accuracy value of the decision tree model.
    - Accuracy refers to the proportion of correct predictions that the model made out of the total predictions made.
    - In this case, the decision tree model got it right about 75.97% of the time.

2. Random Forest Accuracy_score = 0.7662:

    - This value indicates the accuracy of the Random Forest model, which is a combination of multiple decision trees.
    - In this case, the Random Forest model got it right about 76.62% of the predictions.
    - We notice that this model has a slight improvement over the single decision tree, which is common since Random Forest averages the results of multiple trees, reducing the risk of overfitting and improving generalization.

3. Boosting Model Accuracy_score = 0.7467:

    - This is the accuracy value of the Boosting model, which also combines multiple trees but tunes them sequentially (each tree corrects the errors of the previous one).
    - In this case, the model was correct 74.67% of the time.
    - Although in many cases Boosting usually improves performance, in this case it did not outperform Random Forest or Decision Tree, which may be due to the nature of the data or hyperparameter tuning.

### Conclusion
All three models were trained on the same data, but Random Forest turned out to be the most accurate with 76.62% accuracy.