# Hyperparameter
- Linear regression: Choosing parameters
- Ridge/Lasso regression: Choosing alpha
- k-Nearest Neighbors: Choosing n_neighbors
- Hyperparameters cannot be learned by fitting the model

# Choosing the correct hyperparameter
- Try a bunch of different hyperparameter values
- Fit all of them separately
- See how well each performs
- Choose the best performing one
- It is essential to use cross-validation

# Grid search cross-validation
<table>
    <tr><td>C \ Alpha</td><td>0.1</td><td>0.2</td><td>0.3</td><td>0.4</td></tr>
    <tr><td>0.5</td><td>0.701</td><td>0.703</td><td>0.697</td><td>0.696</td></tr>
    <tr><td>0.4</td><td>0.699</td><td>0.702</td><td>0.698</td><td>0.702</td></tr>
    <tr><td>0.3</td><td>0.721</td><td>0.726</td><td>0.713</td><td>0.703</td></tr>
    <tr><td>0.2</td><td>0.706</td><td>0.705</td><td>0.704</td><td>0.701</td></tr>
    <tr><td>0.1</td><td>0.698</td><td>0.692</td><td>0.688</td><td>0.675</td></tr>
</table>

### Hyperparameter for Logsitic Regression
- **C**: regularization parameter which controls the inverse of the regularization strength
- large C can lead to an *overfit* model
- small C can lead to an *underfit* model

In [4]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [2]:
df = pd.read_csv('datasets/diabetes.csv')

X = df.drop('diabetes', axis=1)
y = df['diabetes']

In [18]:
# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

logreg = LogisticRegression()

logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

logreg_cv.fit(X, y)

print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

Tuned Logistic Regression Parameters: {'C': 268.2695795279727}
Best score is 0.7708333333333334


## Randomized Search Cross-validation
To save computation cost

In [23]:
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

# Set up the parameters and distributions to sample from
param_dist = { 'max_depth' : [3, None],
               'max_features' : randint(1,9),
               'min_samples_leaf' : randint(1,9),
               'criterion' : ['gini', 'entropy' ]
             }

tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

tree_cv.fit(X, y)

print('Tuned Decision Tree Parameters:\n{}'.format(tree_cv.best_params_))
print('Best score is {}'.format(tree_cv.best_score_))

Tuned Decision Tree Parameters:
{'criterion': 'entropy', 'max_depth': 3, 'max_features': 6, 'min_samples_leaf': 7}
Best score is 0.7486979166666666
