### **📝 Instructions**
#### **Predicting diabetes**
In the previous project we saw how we could use a decision tree to predict data sets related to classification and regression. However, did you know that we can improve the prediction of a tree by using a random forest?

As we have studied, a random forest is a grouping of trees generated with random portions of the data and with random criteria. This view would allow us to improve the effectiveness of the model when an individual tree is not sufficient.

In this project you will focus on this idea by training the dataset to improve the `accuracy`

In [6]:
# Import libreries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import (MinMaxScaler,
                                   StandardScaler,
                                   LabelEncoder,
                                   OneHotEncoder)
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import (chi2,
                                       SelectKBest,
                                       f_regression)
from sklearn.model_selection import (train_test_split,
                                     GridSearchCV) # For Optimize
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.linear_model import (LogisticRegression,
                                  Lasso)
from sklearn.metrics import (accuracy_score,
                            confusion_matrix,
                            classification_report)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier

# Optimize
from pickle import dump

### **Step 1: Loading the dataset**
Load the processed dataset from the previous project (split into training and test samples and analyzed with EDA).

In [2]:
df_train = pd.read_csv("../data/processed/df_train_clean.csv")
df_test = pd.read_csv("../data/processed/df_test_clean.csv")

In [3]:
display(df_train.head())
display(df_test.head())

Unnamed: 0,pregnancies,glucose,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
0,0.0,111.0,0.0,0.0,24.6,0.66,31.0,0
1,5.0,136.0,41.0,88.0,35.0,0.286,35.0,1
2,2.0,74.0,0.0,0.0,0.0,0.102,22.0,0
3,2.0,141.0,34.0,128.0,25.4,0.699,24.0,0
4,4.0,110.0,20.0,100.0,28.4,0.118,27.0,0


Unnamed: 0,pregnancies,glucose,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
0,5.0,155.0,44.0,545.0,38.7,0.619,34.0,0
1,3.0,158.0,13.0,387.0,31.2,0.295,24.0,0
2,1.0,189.0,23.0,846.0,30.1,0.398,59.0,1
3,2.0,146.0,38.0,360.0,28.0,0.337,29.0,1
4,0.0,78.0,29.0,40.0,36.9,0.434,21.0,0


### **Step 2: Build a random forest**
One way to optimize and improve the results when using decision trees is to generate a random forest with enough trees so that there is the necessary variety to enrich the prediction. Train it and analyze its results. Try modifying the two hyperparameters that define the tree with different values and analyze their impact on the final accuracy and plot the conclusions.

In [4]:
X_train = df_train.drop('outcome', axis=1).reset_index(drop=True)
y_train = df_train['outcome'].reset_index(drop=True)
X_test = df_test.drop('outcome', axis=1).reset_index(drop=True)
y_test = df_test['outcome'].reset_index(drop=True)

In [20]:
clf_rf = RandomForestClassifier(n_estimators=50, bootstrap=True)
clf_rf.fit(X_train, y_train)

In [21]:
y_pred = clf_rf.predict(X_test)
y_pred

array([1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

In [22]:
# Evaluate model
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc}")

Accuracy: 0.7662337662337663


### **Step 3: Save the model**
Store the model in the corresponding folder.

In [23]:
dump(clf_rf, open("../models/ranfor_classifier_nestimators-50_bootstrap-true.sav", "wb"))

### **Step 4: Optimized**

In [19]:
# Definir el modelo
rf = RandomForestClassifier()

# Definir los hiperparámetros a ajustar
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False]
}

# Configurar GridSearchCV
grid_search = GridSearchCV(clf_rf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Ajustar el modelo
grid_search.fit(X_train, y_train)

# Imprimir los mejores hiperparámetros
print(f"Mejores hiperparámetros: {grid_search.best_params_}")

Mejores hiperparámetros: {'bootstrap': True, 'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 50}
