## Cross-validation
**Cross-validation** is a statistical method used to estimate the skill of machine learning models. It is commonly used to assess how the results of a statistical analysis will generalize to an independent data set. The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and give an insight on how the model will generalize to an independent dataset.

The basic form of cross-validation is k-fold cross-validation. Here is how it works:

1. **Split the dataset into k subsets**: The data is divided into k equally (or nearly equally) sized subsets.
2. **For each subset**:
    - Use the subset as the validation set.
    - Use the remaining k-1 subsets as the training set.
    - Train the model on the training set and evaluate it on the validation set.
3. **Aggregate the results**: The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop.

Other forms of cross-validation include:
- **Leave-One-Out Cross-Validation (LOOCV)**: A special case of k-fold cross-validation where k is equal to the number of data points in the dataset.
- **Stratified k-Fold Cross-Validation**: Ensures that each fold is representative of all classes in the data, which is particularly useful for imbalanced datasets.
- **Time Series Cross-Validation**: Used for time series data where the order of data points matters.

Cross-validation helps in selecting the best model and tuning hyperparameters by providing a more accurate estimate of model performance.

``It's a technique for validiting the model efficiency by training it on the subset of input data and testing on previously unseen subset of the input data``

![Image](https://github.com/user-attachments/assets/973db0e2-6d02-43e9-a2d2-bc72659bfb8f)

![CV](https://miro.medium.com/v2/resize:fit:720/format:webp/1*KgFHBaLJQGY-VYcKA3LFLg.png)

![Image](https://github.com/user-attachments/assets/4dd081b4-1918-4535-9012-3b9cc479aea1)

## **K-fold cross-validation** 
**K-fold cross-validation** is a technique to evaluate the performance of a model by dividing the dataset into k equal subsets. The model is trained on k-1 subsets and validated on the remaining subset, repeated k times to ensure robustness.

``It's not suite for imbalence dataset`` if k=5


![K-fold](https://i0.wp.com/spotintelligence.com/wp-content/uploads/2023/07/k-fold-cross-validation-1024x576.webp?resize=1024%2C576&ssl=1)

![K-fold](https://cdn.analyticsvidhya.com/wp-content/uploads/2024/09/37094K-fold-cross-vaslidation.png)

## **Stratified k-Fold Cross-Validation**

Stratified k-Fold Cross-Validation is a variation of k-fold cross-validation that ensures each fold is representative of all classes in the data. This is particularly useful for imbalanced datasets where some classes are underrepresented. The process is as follows:

1. **Split the dataset into k subsets**: The data is divided into k equally (or nearly equally) sized subsets, ensuring that each subset has the same proportion of each class as the original dataset.
2. **For each subset**:
    - Use the subset as the validation set.
    - Use the remaining k-1 subsets as the training set.
    - Train the model on the training set and evaluate it on the validation set.
3. **Aggregate the results**: The performance measure reported by stratified k-fold cross-validation is then the average of the values computed in the loop.

This method helps in providing a more accurate estimate of model performance, especially when dealing with imbalanced datasets.

## When to Use
- When working with datasets that have an unbalanced distribution of classes.
- When random shuffling and splitting the data is not sufficient.
- When you want to have a correct distribution of data in each fold.

## Benefits
- Ensures that each fold of the dataset contains approximately the same percentage of samples of each class as the complete set.
- Mitigates bias and improves overall performance.
- Provides a more robust and reliable estimate of model performance.

## How It Works
- Preserves the original class distribution in each fold.
- Guarantees that your model is trained and tested on a representative sample of each class.


![](https://dataaspirant.com/wp-content/uploads/2020/12/8-Stratified-K-Fold-Cross-Validation-768x516.png)

## LOOCV (Leave One Out Cross-Validation)

![](https://dataaspirant.com/wp-content/uploads/2023/10/3-3-768x811.png)

## Leave-One-Out Cross-Validation (LOOCV)

Leave-One-Out Cross-Validation (LOOCV) is a resampling procedure used to evaluate machine learning models on a limited data sample. The method has a simple yet meticulous approach, carefully attending to each data point and assessing the model’s predictive capability with precision.

### Step 1: Data Preparation

- **Dataset Isolation**: Isolate your dataset, ensuring it is cleansed and pre-processed, ready for model evaluation.
- **Data Segregation**: Identify individual data points; each will serve as a validation set in its turn.

### Step 2: Iterative Model Training and Validation

1. **Iteration Initiation**: Begin with the first data point as the validation set and the remainder as the training set.
2. **Model Training**: Employ the training set to train your model, fine-tuning as per algorithm-specific parameters.
3. **Validation Assessment**: Utilize the isolated data point to validate the model, recording the error metric or model prediction.
4. **Iteration Continuation**: Progress to the next data point, reallocating the training and validation sets accordingly, and repeat the training and validation process.

### Step 3: Error Aggregation

- **Error Calculation**: For each iteration, compute and store the error metric (such as Mean Squared Error for regression or Accuracy for classification).
- **Aggregate Error**: Once all iterations are complete, average the recorded error metrics to procure an overall performance estimate.

### Step 4: Model Evaluation

- **Performance Insight**: The aggregated error provides insight into the model’s predictive capability and generalization to unseen data.
- **Model Comparison**: Use the aggregated error to compare the effectiveness of different models or model parameters.

### Step 5: Final Model Training

- **Comprehensive Training**: Once model selection and tuning are complete, utilize the entire dataset to train the final model.
- **Real-world Application**: Implement the fully trained model to make predictions on new, unseen data.

### Step 6: Review and Reflection

- **Model Review**: Reflect on the model’s performance and consider whether alternative approaches or additional tuning is warranted.
- **Practical Implication**: Consider the practical implications of the model, ensuring it aligns with the problem context and project objectives.


## Leave-p-out cross validation
"Leave-p-out cross validation"refers to a cross-validation technique where a subset of "p" observations from a dataset are used as the validation set, while the remaining data is used for training, and this process is repeated by iterating through every possible combination of "p" observations as the validation set, essentially testing the model against different "p" sized validation sets across the entire dataset

![](https://dataaspirant.com/wp-content/uploads/2023/10/1-3-768x467.png)