**Q. What is cross-validation, and why is it important in model evaluation?**

**Ans.**

**cross-validation definition:**

Cross-validation is a resampling technique used to evaluate the performance of a machine learning model on unseen data. It involves dividing the available data into multiple folds or subsets, training the model on a portion of the data, and then evaluating its performance on the remaining portion. This process is repeated multiple times, with different folds used for training and testing in each iteration.

**Importance of cross-validation in model evaluation:**

- **Prevents Overfitting:** Evaluates model performance on unseen data, reducing the risk of overfitting.   
- **Model Selection:** Helps choose the best model or hyperparameters based on performance on multiple folds.   
- **Performance Estimation:** Provides a more reliable estimate of model performance than a single train-test split.

### Load the Dataset

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [7]:
# Load the Wine Quality dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(url, delimiter=";")

# Separate features and target
X = df.drop("quality", axis=1)
y = (df["quality"] >= 6).astype(int)  # Binary classification: Good (>=6) vs Bad (<6)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Apply Cross-Validation
Perform cross-validation on multiple models to compare performance.

In [8]:
from sklearn.model_selection import StratifiedKFold

# Define models
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
    "SVM": SVC(random_state=42)
}

# Stratified k-fold for consistent splits
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate models using cross-validation
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="accuracy")
    print(f"{name}: Mean Accuracy = {scores.mean():.4f}, Std Dev = {scores.std():.4f}")

Logistic Regression: Mean Accuracy = 0.7443, Std Dev = 0.0195
Random Forest: Mean Accuracy = 0.8030, Std Dev = 0.0190
SVM: Mean Accuracy = 0.6286, Std Dev = 0.0309


### Train and Evaluate the Best Model
Choose the best model based on cross-validation performance and evaluate on the test set.

In [9]:
# Select the best model (e.g., Random Forest in this case)
best_model = RandomForestClassifier(random_state=42)
best_model.fit(X_train, y_train)

# Test set evaluation
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"\nBest Model Test Accuracy: {test_accuracy:.4f}")


Best Model Test Accuracy: 0.7906
