## 05. Improving a (Machine Learning) Model

Predictions made during the lifecycle of a model are often start with the very first prediction where the data might not be very accurate. This is likewise known as the following

* First Prediction: Baseline Prediction.
* First Model: Baseline Model.

***HyperParameters Vs. (Data)Parameters?***
* `hyper-parameter` setting on a model you can adjust to (potentially) improve its ability to find patterns
* `parameter` parameters found within the dataset provided to the model

***3 Ways to Adjust HyperParameters***
1. By Hand (Manually
2. Randomly with `RandomSearchCV`
3. Exhaustively by `GridSearchCV`



***How do We Make our Model Better?***

**Data Perspective:**
* Could we collect more data? (The more, the better)
* Could we improve our data? (add more depth, parameters to our data)

**Model Perspective:**
* Is there a better model we could use? (Check the SciKit Estimators Map)
* Could we improve the current model?

In [37]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### 5.1 Tuning Hyper Parameters by Hand
We'll make 3 datasets:
1. Training
2. Validation
3. Testing

In [20]:
# create a evaluate prediction function 
def evaluate_pred(y_true, y_preds):
    """
    Performs evaluation comparison on y_true labels vs y_pred labels
    on a classification model.
    """
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    
    metric_dict = {
        "accuracy": round(accuracy, 2),
        "precision": round(precision, 2),
        "recall": round(recall, 2),
        "f1 score": round(f1, 2)
    }
    
    print(f"Accuracy: {accuracy * 100:.2f}%")
    print(f"Precision: {precision * 100:.2f}%")
    print(f"Recall: {recall * 100:.2f}%")
    print(f"F1 Score: {f1 * 100:.2f}%")
    
    return metric_dict

In [28]:
# import the heart disease dataset
heart_disease = pd.read_csv('data/heart-disease.csv')
heart_disease.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [65]:
# shuffle the samples and divide into features X and samples y
np.random.seed(42)
heart_disease_shuffled = heart_disease.sample(frac=1)

X = heart_disease_shuffled.drop('target', axis=1)
y = heart_disease_shuffled['target']

# split the data into train, validate and test samples
train_split = round(0.70 * len(heart_disease_shuffled))
valid_split = round(train_split + 0.15 * len(heart_disease_shuffled))

X_train, y_train = X[:train_split], y[:train_split]
X_valid, y_valid = X[train_split:valid_split], y[train_split:valid_split]
X_test,  y_test  = X[valid_split:], y[valid_split:]


True