### Cross-validation is a statistical method used in machine learning to evaluate the performance of a model. It involves partitioning the data into subsets, training the model on some subsets, and testing it on the remaining subsets. This helps in assessing how the model will generalize to an independent dataset. Here are various cross-validation techniques, along with their advantages, disadvantages, and Python code examples

## K-Fold Cross-Validation
Description
The dataset is divided into k equally sized folds. The model is trained k times, each time using k-1 folds for training and the remaining fold for testing.

Advantages
Reduces bias as every data point gets to be in a test set exactly once and in a training set k-1 times.
Reduces variance as the performance measure is averaged over k different training and test sets.
Disadvantages
Computationally expensive for large k values.
Not suitable for very small datasets.

In [1]:
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Sample data
X = np.array([[i] for i in range(10)])
y = np.array([2*i + 1 for i in range(10)])

kf = KFold(n_splits=5)
model = LinearRegression()
mse_scores = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse_scores.append(mean_squared_error(y_test, y_pred))

print(f'Mean MSE: {np.mean(mse_scores)}')


Mean MSE: 8.993014319519535e-30


## kf.split(X) splits the data into 5 folds, providing train_index and test_index for each iteration.
## X_train and X_test are created by indexing X with train_index and test_index, respectively.
## y_train and y_test are created by indexing y with train_index and test_index, respectively.
## The model is trained using X_train and y_train.
## Predictions (y_pred) are made using X_test.
## The mean squared error between y_test and y_pred is calculated and appended to mse_scores.

In [3]:
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    print(f'Train Index: {train_index}')
    print(f'Test Index: {test_index}')
    print(

Train Index: [2 3 4 5 6 7 8 9]
Test Index: [0 1]
X_train: [[2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]
X_test: [[0]
 [1]]
y_train: [ 5  7  9 11 13 15 17 19]
y_test: [1 3]
Train Index: [0 1 4 5 6 7 8 9]
Test Index: [2 3]
X_train: [[0]
 [1]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]
X_test: [[2]
 [3]]
y_train: [ 1  3  9 11 13 15 17 19]
y_test: [5 7]
Train Index: [0 1 2 3 6 7 8 9]
Test Index: [4 5]
X_train: [[0]
 [1]
 [2]
 [3]
 [6]
 [7]
 [8]
 [9]]
X_test: [[4]
 [5]]
y_train: [ 1  3  5  7 13 15 17 19]
y_test: [ 9 11]
Train Index: [0 1 2 3 4 5 8 9]
Test Index: [6 7]
X_train: [[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [8]
 [9]]
X_test: [[6]
 [7]]
y_train: [ 1  3  5  7  9 11 17 19]
y_test: [13 15]
Train Index: [0 1 2 3 4 5 6 7]
Test Index: [8 9]
X_train: [[0]
 [1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]]
X_test: [[8]
 [9]]
y_train: [ 1  3  5  7  9 11 13 15]
y_test: [17 19]


# Stratified K-Fold Cross-Validation
##Description
## Similar to K-Fold but ensures that each fold has the same proportion of class labels as the original dataset. This is particularly useful for imbalanced datasets.

## Advantages
Maintains the distribution of classes in each fold.
Better for classification problems with imbalanced classes.
Disadvantages
Computationally expensive for large k values.
Not suitable for very small datasets.

In [6]:
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
X = np.array([[i] for i in range(10)])
y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])  # Imbalanced classes

skf = StratifiedKFold(n_splits=5)
model = LogisticRegression()
accuracy_scores = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

print(f'Mean Accuracy: {np.mean(accuracy_scores)}')


Mean Accuracy: 0.6


### Leave-One-Out Cross-Validation (LOOCV)
## Description
## A special case of k-fold cross-validation where k equals the number of data points in the dataset. Each observation is used as a single test set while the remaining observations form the training set.

##Advantages
### Uses maximum data for training in each iteration.
Very unbiased because each data point is tested once.
Disadvantages
## Extremely computationally expensive for large datasets.
High variance since each training set is nearly identical.

In [7]:
from sklearn.model_selection import LeaveOneOut
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Sample data
X = np.array([[i] for i in range(10)])
y = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

loo = LeaveOneOut()
model = DecisionTreeClassifier()
accuracy_scores = []

for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

print(f'Mean Accuracy: {np.mean(accuracy_scores)}')


Mean Accuracy: 0.0


## Time Series Split (Rolling Cross-Validation)
# Description
## Used for time series data where the order of data points is important. The dataset is split into training and test sets based on a rolling window approach.

## Advantages

# Preserves the temporal order of data points.
# Useful for time series forecasting problems.
# Disadvantages
# Limited to time series data.
# Can be less stable if the time series has strong trends or seasonalities.

In [8]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error

# Sample data
X = np.array([[i] for i in range(10)])
y = np.array([2*i + 1 for i in range(10)])

tscv = TimeSeriesSplit(n_splits=5)
model = SVR()
mae_scores = []

for train_index, test_index in tscv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae_scores.append(mean_absolute_error(y_test, y_pred))

print(f'Mean MAE: {np.mean(mae_scores)}')


Mean MAE: 6.751543153635419


# Randomized Search Cross-Validation
# Description
# Combines cross-validation with random search of hyperparameter tuning. It randomly samples hyperparameters and evaluates the model performance using cross-validation.

# Advantages
# Efficient for hyperparameter tuning.
# Can handle a large search space.
# Disadvantages
# Requires a large number of iterations to be effective.
# Computationally expensive for complex models.

In [9]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# Sample data
X = np.random.rand(100, 5)
y = np.random.randint(0, 2, size=100)

param_dist = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

model = RandomForestClassifier()
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=5)
random_search.fit(X, y)

print(f'Best Parameters: {random_search.best_params_}')
print(f'Best Score: {random_search.best_score_}')


Best Parameters: {'n_estimators': 50, 'min_samples_split': 5, 'max_depth': 10}
Best Score: 0.63
