Bias-variance trade-off:

1. Models with a high bias (underfitting) usually have a low variance.
2. Models with a low bias usually have a high variance (overfitting).

Bias-variance decomposition:

   *Generalization error = Bias^2 + Variance + Irreducible Error*
    

Minimizing generalization error one of main goals of machine learning and many ways to do this:

1. By tuning the complexity of the model using grid search
2. By reducing the bias or the variance directly

**Grid Search using Parameter Grid**

*start by fitting k-NN Classifier*

In [1]:
import pandas as pd

# Load data
data_df = pd.read_csv("c4_heart-numerical.csv")

# Create X/y arrays
X = data_df.drop("disease", axis=1).values
y = data_df.disease.values

data_df.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,disease
0,63,145,233,150,2.3,0,absence
1,67,160,286,108,1.5,3,presence
2,67,120,229,129,2.6,2,presence
3,37,130,250,187,3.5,0,absence
4,41,130,204,172,1.4,0,absence


In [2]:
from sklearn.model_selection import train_test_split

# Split data into train/test sets
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)

Using default values ie without setting any parameters:

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# Create a k-NN classifier with default values
pipe = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier())])

# Fit to train data
pipe.fit(X_tr, y_tr)

# Evaluate on test set
accuracy = pipe.score(X_te, y_te)
print("Accuracy: {:.3f}".format(accuracy))

Accuracy: 0.747


*Grid Search for multiple parameters*

To improve k-NN classifier need to tune hyperparameters:

1. n_neighbors - the number of neighbors.
2. p - the distance metric. Scikit-learn implements the L1 & L2 ones.
3. weights - The weighting function for the majority vote.

When doing the majority vote, the classifier can use a weighting function. By default, all points have the same weight. This corresponds to the 'uniform' strategy. However, we can also give more weights to closer data points. For instance, the 'distance' strategy assigns a weight inversely proportional to their distance.

In [5]:
import numpy as np

# Define a set of reasonable values
k_values = np.arange(1, 21)  # 1, 2, 3, .., 20
weights_functions = ["uniform", "distance"]
distance_types = [1, 2]  # L1, L2 distances

In [6]:
# Create a k-NN classifier
pipe = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier())])

# Save accuracy on test set
test_scores = []

# Grid search
for k in k_values:
    for f in weights_functions:
        for d in distance_types:
            # Set hyperparameters
            pipe.set_params(knn__n_neighbors=k, knn__weights=f, knn__p=d)

            # Fit a k-NN classifier
            pipe.fit(X_tr, y_tr)

            # Evaluate on test set
            accuracy = pipe.score(X_te, y_te)

            # Save accuracy
            test_scores.append(
                {
                    "knn__n_neighbors": k,
                    "knn__weights": f,
                    "knn__p": d,
                    "accuracy": accuracy,
                }
            )

In [7]:
# Create DataFrame with test scores
scores_df = pd.DataFrame(test_scores)

# Top five scores
scores_df.sort_values(by="accuracy", ascending=False).head()

Unnamed: 0,knn__n_neighbors,knn__weights,knn__p,accuracy
14,4,distance,1,0.813187
28,8,uniform,1,0.813187
12,4,uniform,1,0.802198
32,9,uniform,1,0.802198
30,8,distance,1,0.802198


Can see its possible to achieve 80% accuracy with L1 distance metric, however becomes complicated quickly if more hyperparameters to tune or several different grids to evaluate.

**Grid Search using ParameterGird**

Takes a dictionary of (parameter, values) pairs

In [8]:
from sklearn.model_selection import ParameterGrid

# Define a grid of values
grid = ParameterGrid(
    {
        "knn__n_neighbors": k_values,
        "knn__weights": weights_functions,
        "knn__p": distance_types,
    }
)

# Print the number of combinations
print("Number of combinations:", len(grid))
# Prints: 80

Number of combinations: 80


This grid variable represents all combinations of paramters and can use it as a list.

In [9]:
# Iterate through each combination of parameters
for params_dict in list(grid)[
    :5
]:  # We use list(iter) in order to be able to slice and show only 5 first elements
    print(params_dict)

{'knn__n_neighbors': 1, 'knn__p': 1, 'knn__weights': 'uniform'}
{'knn__n_neighbors': 1, 'knn__p': 1, 'knn__weights': 'distance'}
{'knn__n_neighbors': 1, 'knn__p': 2, 'knn__weights': 'uniform'}
{'knn__n_neighbors': 1, 'knn__p': 2, 'knn__weights': 'distance'}
{'knn__n_neighbors': 2, 'knn__p': 1, 'knn__weights': 'uniform'}


Each iteration contains a dictionary with current combination of values. We can now use this to spply within grid search:

In [10]:
# Create k-NN classifier
pipe = Pipeline([("scaler", StandardScaler()), ("knn", KNeighborsClassifier())])

# Save accuracy on test set
test_scores = []

for params_dict in grid:
    # Set parameters
    pipe.set_params(**params_dict)

    # Fit a k-NN classifier
    pipe.fit(X_tr, y_tr)

    # Save accuracy on test set
    params_dict["accuracy"] = pipe.score(X_te, y_te)

    # Save result
    test_scores.append(params_dict)

** as using kwargs syntax which is way to work with keyword arguments. in short, idea set arguments to function using dictionary of (keyword, value)

In [11]:
# Value of params_dict for the first combination
params_dict = {"knn__n_neighbors": 1, "knn__p": 1, "knn__weights": "uniform"}

# Setting the parameters using the **kwargs syntax
pipe.set_params(**params_dict)

# .. is equivalent to
pipe.set_params(knn__n_neighbors=1, knn__p=1, knn__weights="uniform")

Pipeline(steps=[('scaler', StandardScaler()),
                ('knn', KNeighborsClassifier(n_neighbors=1, p=1))])

In [12]:
# Create DataFrame with test scores
scores_df = pd.DataFrame(test_scores)

# Top five scores
scores_df.sort_values(by="accuracy", ascending=False).head()

Unnamed: 0,knn__n_neighbors,knn__p,knn__weights,accuracy
28,8,1,uniform,0.813187
13,4,1,distance,0.813187
32,9,1,uniform,0.802198
12,4,1,uniform,0.802198
37,10,1,distance,0.802198


**Multiple Grids**

In [13]:
# Define two grids
grid = ParameterGrid(
    [
        {"knn__n_neighbors": [2, 3], "knn__p": [1, 2]},
        {"knn__weights": ["uniform", "distance"], "knn__p": [1, 2]},
    ]
)

# List combinations
list(grid)

[{'knn__n_neighbors': 2, 'knn__p': 1},
 {'knn__n_neighbors': 2, 'knn__p': 2},
 {'knn__n_neighbors': 3, 'knn__p': 1},
 {'knn__n_neighbors': 3, 'knn__p': 2},
 {'knn__p': 1, 'knn__weights': 'uniform'},
 {'knn__p': 1, 'knn__weights': 'distance'},
 {'knn__p': 2, 'knn__weights': 'uniform'},
 {'knn__p': 2, 'knn__weights': 'distance'}]

Both grids missing part of required Hparams but can set to default value

In [14]:
# Define two grids
grid = ParameterGrid(
    [
        {
            "knn__n_neighbors": [2, 3],
            "knn__weights": ["uniform"],  # Default value: uniform
            "knn__p": [1, 2],
        },
        {
            "knn__n_neighbors": [5],  # Default value: 5
            "knn__weights": ["uniform", "distance"],
            "knn__p": [1, 2],
        },
    ]
)

# List combinations
list(grid)

[{'knn__n_neighbors': 2, 'knn__p': 1, 'knn__weights': 'uniform'},
 {'knn__n_neighbors': 2, 'knn__p': 2, 'knn__weights': 'uniform'},
 {'knn__n_neighbors': 3, 'knn__p': 1, 'knn__weights': 'uniform'},
 {'knn__n_neighbors': 3, 'knn__p': 2, 'knn__weights': 'uniform'},
 {'knn__n_neighbors': 5, 'knn__p': 1, 'knn__weights': 'uniform'},
 {'knn__n_neighbors': 5, 'knn__p': 1, 'knn__weights': 'distance'},
 {'knn__n_neighbors': 5, 'knn__p': 2, 'knn__weights': 'uniform'},
 {'knn__n_neighbors': 5, 'knn__p': 2, 'knn__weights': 'distance'}]

**Optional Steps**

Possible to disable a step by setting it to None... eg without standardization would be:

In [15]:
# Grid with optional steps
grid = ParameterGrid(
    {
        "scaler": [None, StandardScaler()],
        "knn__n_neighbors": [5, 10, 15],
    }
)

# List combinations
list(grid)

[{'knn__n_neighbors': 5, 'scaler': None},
 {'knn__n_neighbors': 5, 'scaler': StandardScaler()},
 {'knn__n_neighbors': 10, 'scaler': None},
 {'knn__n_neighbors': 10, 'scaler': StandardScaler()},
 {'knn__n_neighbors': 15, 'scaler': None},
 {'knn__n_neighbors': 15, 'scaler': StandardScaler()}]