## 05. Improving a (Machine Learning) Model

Predictions made during the lifecycle of a model are often start with the very first prediction where the data might not be very accurate. This is likewise known as the following

* First Prediction: Baseline Prediction.
* First Model: Baseline Model.

***HyperParameters Vs. (Data)Parameters?***
* `hyper-parameter` setting on a model you can adjust to (potentially) improve its ability to find patterns
* `parameter` parameters found within the dataset provided to the model

***3 Ways to Adjust HyperParameters***
1. By Hand (Manually
2. Randomly with `RandomSearchCV`
3. Exhaustively by `GridSearchCV`



***How do We Make our Model Better?***

**Data Perspective:**
* Could we collect more data? (The more, the better)
* Could we improve our data? (add more depth, parameters to our data)

**Model Perspective:**
* Is there a better model we could use? (Check the SciKit Estimators Map)
* Could we improve the current model?

In [7]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier

### 5.1 Tuning Hyper Parameters by Hand
We'll make 3 datasets:
1. Training
2. Validation
3. Testing

In [7]:
# create a evaluate prediction function 
def evaluate_pred(y_true, y_preds):
    """
    Performs evaluation comparison on y_true labels vs y_pred labels
    on a classification model.
    """
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    
    metric_dict = {
        "accuracy": round(accuracy, 2),
        "precision": round(precision, 2),
        "recall": round(recall, 2),
        "f1 score": round(f1, 2)
    }
    
    print(f"Accuracy: {accuracy * 100:.2f}%")
    print(f"Precision: {precision * 100:.2f}%")
    print(f"Recall: {recall * 100:.2f}%")
    print(f"F1 Score: {f1 * 100:.2f}%")
    
    return metric_dict

In [8]:
# import the heart disease dataset
heart_disease = pd.read_csv('data/heart-disease.csv')
heart_disease.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [9]:
# shuffle the samples and divide into features X and samples y
np.random.seed(42)
heart_disease_shuffled = heart_disease.sample(frac=1)

X = heart_disease_shuffled.drop('target', axis=1)
y = heart_disease_shuffled['target']

# split the data into train, validate and test samples
train_split = round(0.70 * len(heart_disease_shuffled))
valid_split = round(train_split + 0.15 * len(heart_disease_shuffled))

X_train, y_train = X[:train_split], y[:train_split]
X_valid, y_valid = X[train_split:valid_split], y[train_split:valid_split]
X_test,  y_test  = X[valid_split:], y[valid_split:]

# fitting the data to the model
clf = RandomForestClassifier()

clf.fit(X_train, y_train)
y_preds = clf.predict(X_valid)

baseline_metrics = evaluate_pred(y_valid, y_preds)

Accuracy: 82.22%
Precision: 81.48%
Recall: 88.00%
F1 Score: 84.62%


In [10]:
# create a second classifier with adjusted n_estimators
np.random.seed(42)
clf_2 = RandomForestClassifier(n_estimators=200, min_samples_leaf=3)
clf_2.fit(X_train, y_train)

# prediction
clf_y_preds = clf_2.predict(X_valid)
baseline_metrics_2 = evaluate_pred(y_valid, clf_y_preds)

Accuracy: 84.44%
Precision: 82.14%
Recall: 92.00%
F1 Score: 86.79%


### 5.2 Tuning with RandomizedSearchCV

In [11]:
from sklearn.model_selection import RandomizedSearchCV, train_test_split

grid = {
    "n_estimators": [10, 100, 200, 500, 1000, 1200],
    "max_depth": [None, 5, 10, 20, 30],
    "max_features": ["auto", "sqrt"],
    "min_samples_split": [2, 4, 6],
    "min_samples_leaf": [1, 2, 4]
}

np.random.seed(42)
#spilt into X and y
X = heart_disease_shuffled.drop("target", axis=1)
y = heart_disease_shuffled["target"]

# split into train and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#instantiate the classifier
clf = RandomForestClassifier(n_jobs=1)

# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(
    clf,
    param_distributions=grid,
    n_iter=10, # number of iterations on models to try
    cv=5,
    verbose=2
)

# Fit the RandomizedSearchCV version of Classifier
rs_clf.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.1s
[CV] END max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.1s
[CV] END max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.2s
[CV] END max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.2s
[CV] END max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.1s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   0.1s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   0.2s
[CV] END max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimato

RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=1),
                   param_distributions={'max_depth': [None, 5, 10, 20, 30],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 4, 6],
                                        'n_estimators': [10, 100, 200, 500,
                                                         1000, 1200]},
                   verbose=2)

In [15]:
rs_clf.best_params_

{'n_estimators': 1200,
 'min_samples_split': 6,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 30}

In [16]:
rs_y_preds = rs_clf.predict(X_valid)
rs_metrics = evaluate_pred(y_valid, rs_y_preds)

Accuracy: 84.44%
Precision: 82.14%
Recall: 92.00%
F1 Score: 86.79%


### 5.3 Tuning with GridSearchCV

In [17]:
from sklearn.model_selection import GridSearchCV
np.random.seed(42)

param_grid = {
    "n_estimators": [100, 200, 500],
    "max_depth": [None],
    "max_features": ["auto", "sqrt"],
    "min_samples_split": [6],
    "min_samples_leaf": [1, 2]
}

# instantiate classifier
clf = RandomForestClassifier(n_jobs=1)
gs_clf = GridSearchCV(clf, param_grid, cv=5, verbose=2)

# model fitting
gs_clf.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.2s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.2s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.2s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.2s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.1s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=200; total time=   0.3s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=200; total time=   0.4s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, 

GridSearchCV(cv=5, estimator=RandomForestClassifier(n_jobs=1),
             param_grid={'max_depth': [None], 'max_features': ['auto', 'sqrt'],
                         'min_samples_leaf': [1, 2], 'min_samples_split': [6],
                         'n_estimators': [100, 200, 500]},
             verbose=2)

In [18]:
gs_clf.best_params_

{'max_depth': None,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 6,
 'n_estimators': 100}

In [19]:
gs_y_preds = gs_clf.predict(X_valid)
gs_metrics = evaluate_pred(y_valid, gs_y_preds)

Accuracy: 80.00%
Precision: 80.77%
Recall: 84.00%
F1 Score: 82.35%


## Compare Different Model Metrics

In [20]:
model_comparison = pd.DataFrame(data={
    "baseline": baseline_metrics,
    "baseline_2": baseline_metrics_2,
    "RandomizedSearch": rs_metrics,
    "GridSearch": gs_metrics
})

In [21]:
import matplotlib.pyplot as plt
%matplotlib inline

In [27]:
model_comparison

Unnamed: 0,baseline,baseline_2,RandomizedSearch,GridSearch
accuracy,0.82,0.84,0.84,0.8
precision,0.81,0.82,0.82,0.81
recall,0.88,0.92,0.92,0.84
f1 score,0.85,0.87,0.87,0.82


## 6. Save and Loading our Trained Machine Learning Model

Two ways to save and load machine learning models:
1. Python `pickle` module
2. `Joblib` module

In [30]:
import pickle

# Save an existing model to file
pickle.dump(gs_clf, open("gs_random_forest_model_1.pkl", "wb"))

In [32]:
# load a saved model
loaded_pickle_model = pickle.load(open(file="models/gs_random_forest_model_1.pkl", mode="rb"))

In [35]:
# make predictions with saved model
pickle_y_preds = loaded_pickle_model.predict(X_valid)

In [37]:
evaluate_pred(y_valid, pickle_y_preds);

Accuracy: 80.00%
Precision: 80.77%
Recall: 84.00%
F1 Score: 82.35%


**Joblib**

In [42]:
from joblib import dump, load

# save model to file
dump(gs_clf, filename="models/gs_random_forest_model_1.1.joblib")

['models/gs_random_forest_model_1.1.joblib']

In [43]:
# import a saved joblib model
loaded_joblib_model = load(filename="models/gs_random_forest_model_1.1.joblib")

In [44]:
joblib_y_preds = loaded_joblib_model.predict(X_valid)

In [45]:
evaluate_pred(y_valid, joblib_y_preds);

Accuracy: 80.00%
Precision: 80.77%
Recall: 84.00%
F1 Score: 82.35%


## 7. Putting t All Together

In [3]:
car_sales = pd.read_csv('data/car-sales-extended-missing-data.csv')
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [4]:
car_sales.dtypes

Make              object
Colour            object
Odometer (KM)    float64
Doors            float64
Price            float64
dtype: object

In [5]:
car_sales.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

**Steps We'll have to do:**

1. Fill Missing data
2. Convert data to numbers
3. Build model on data


In [8]:
# Getting data ready
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Modeling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

# setup random seed
np.random.seed(42)

# import data
data = pd.read_csv('data/car-sales-extended-missing-data.csv')
data.dropna(subset=["Prices"], inplace=True)

# Define dataset features & pipeline
categorical_features = ["Make", "Colour"]
categorical_transporter = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

door_features = ["Doors"]
door_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value=4))
])

numerical_features = ["Odometer (KM)"]
numerical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))
])


# Preprocessing: Fill missing values & convert to numbers
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", categorical_transporter, categorical_features),
        ("door", door_transformer, door_features),
        ("num", numerical_transformer, numerical_features)
    ])

# create a preprocessing and modeling pipeline
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", RandomForestRegressor())
])

# split data
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model.fit(X_train, y_train)