### Cross-Validation Techniques with Hyperparameter Tuning

Cross-validation is a statistical method used to estimate the performance of machine learning models. It helps in assessing how the results of a model will generalize to an independent dataset. Hyperparameter tuning involves finding the best set of hyperparameters for a machine learning algorithm. Combining cross-validation with hyperparameter tuning helps in selecting the optimal model parameters while minimizing overfitting.

#### Common Cross-Validation Techniques

1. **Holdout Validation:**
   - Split the dataset into two parts: a training set and a testing set.
   - Train the model on the training set and evaluate it on the testing set.
   - Simple but can be unstable as it depends on a single split.

2. **K-Fold Cross-Validation:**
   - Split the dataset into K equally sized folds.
   - For each fold, train the model on K-1 folds and validate it on the remaining fold.
   - Repeat this process K times, each time using a different fold as the validation set.
   - Average the performance across all K iterations.
   - More reliable than holdout as it uses multiple splits.

#### Hyperparameter Tuning

Hyperparameter tuning is the process of optimizing the hyperparameters of a machine learning model to achieve the best performance. Common methods include:

1. **Grid Search:**
   - Define a grid of possible hyperparameter values.
   - Evaluate the model for each combination of hyperparameters using cross-validation.
   - Select the combination that yields the best performance.

2. **Random Search:**
   - Randomly sample hyperparameter values from the defined grid.
   - Evaluate the model for each sampled combination using cross-validation.
   - More efficient than grid search for large hyperparameter spaces.

### Practical Implementation

Let's see how to combine cross-validation with hyperparameter tuning using scikit-learn.

1. **Import Necessary Libraries:**

In [14]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

2. **Load Dataset:**

For simplicity, we'll use a synthetic dataset.

In [15]:
# Generating a synthetic dataset
np.random.seed(0)
X = np.random.rand(100, 5)
y = 3 * X[:, 0] + 2 * X[:, 1] + X[:, 2] + np.random.randn(100)

In [16]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

3. **K-Fold Cross-Validation:**

In [17]:
# Define K-Fold cross-validator
kf = KFold(n_splits=5, shuffle=True, random_state=0)

In [18]:
# Example model: Ridge Regression
model = Ridge()

In [19]:
# Cross-validation scores
scores = []

for train_index, test_index in kf.split(X_train):
    X_train_k, X_val_k = X_train[train_index], X_train[test_index]
    y_train_k, y_val_k = y_train[train_index], y_train[test_index]
    
    model.fit(X_train_k, y_train_k)
    y_pred_k = model.predict(X_val_k)
    score = mean_squared_error(y_val_k, y_pred_k)
    scores.append(score)

print(f'Cross-Validation MSE scores: {scores}')
print(f'Mean Cross-Validation MSE: {np.mean(scores)}')

Cross-Validation MSE scores: [1.123026250858799, 1.310962679762743, 0.9738349081283368, 1.0792600611839485, 1.4403163291786139]
Mean Cross-Validation MSE: 1.1854800458224883


4. **Hyperparameter Tuning with Grid Search:**

In [47]:
# Define the parameter grid
param_grid = {
    'alpha': [0.1,1.0, 10.0, 100.0]
}

In [58]:
# Initialize GridSearchCV
grid_search = GridSearchCV(Ridge(), param_grid, cv=5, scoring='neg_mean_absolute_error') #'neg_mean_absolute_error','r2','neg_root_mean_squared_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error'

In [59]:
# Fit GridSearchCV
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=Ridge(),
             param_grid={'alpha': [0.1, 1.0, 10.0, 100.0]},
             scoring='neg_mean_absolute_error')

In [60]:
# Best hyperparameters and corresponding score
print(f'Best hyperparameters: {grid_search.best_params_}')
print(f'Best cross-validation MSE: {-grid_search.best_score_}')

Best hyperparameters: {'alpha': 0.1}
Best cross-validation MSE: 0.8908434954079025


In [61]:
# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred_test = best_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred_test)
print(f'Test MSE: {test_mse}')

Test MSE: 1.1460291479036144


5. **Hyperparameter Tuning with Random Search:**

In [62]:
# Define the parameter distribution
param_dist = {
    'alpha': [0.1, 1.0, 10.0, 100.0]
}

In [70]:
# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(Ridge(), param_dist, n_iter=10, cv=5, scoring='neg_mean_absolute_error', random_state=0)

In [71]:
# Fit RandomizedSearchCV
random_search.fit(X_train, y_train)



RandomizedSearchCV(cv=5, estimator=Ridge(),
                   param_distributions={'alpha': [0.1, 1.0, 10.0, 100.0]},
                   random_state=0, scoring='neg_mean_absolute_error')

In [74]:
# Best hyperparameters and corresponding score
print(f'Best hyperparameters: {random_search.best_params_}')
print(f'Best cross-validation MSE: {-random_search.best_score_}')

Best hyperparameters: {'alpha': 0.1}
Best cross-validation MSE: 0.8908434954079025


In [73]:
# Evaluate the best model on the test set
best_model_random = random_search.best_estimator_
y_pred_test_random = best_model_random.predict(X_test)
test_mse_random = mean_squared_error(y_test, y_pred_test_random)
print(f'Test MSE: {test_mse_random}')

Test MSE: 1.1460291479036144


### Summary

Combining cross-validation with hyperparameter tuning ensures that the model you select is both well-generalized and optimally configured. K-Fold cross-validation provides a robust estimate of model performance, while grid search and random search are effective techniques for finding the best hyperparameters. By using these methods, you can improve the accuracy and reliability of your machine learning models.