**Q. What is overfitting in machine learning, and how can it be prevented?**

**Ans.**

**Overfitting definition:**

Overfitting occurs when a machine learning model becomes too complex and learns the training data too well, including its noise and irregularities. As a result, the model performs poorly on new, unseen data. It essentially memorizes the training data rather than learning the underlying patterns.

**Techniques to prevent overfitting:**

- **Increase training data:** More data helps the model learn general patterns.
- **Regularization:** Penalizes complex models to avoid overfitting.
- **Early stopping:** Stops training before the model starts memorizing noise.
- **Feature selection:** Reduces the number of features used.
- **Model simplification:** Uses simpler models with fewer parameters.
- **Ensemble methods:** Combines multiple models to improve generalization.
- **Cross-validation:** Evaluates the model's performance on different subsets of data.

### Load the Dataset

In [14]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

In [15]:
# Load the dataset
data = load_iris()
X = data.data
y = data.target

In [16]:
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Build an Overfitting Model
Train a decision tree without any regularization to demonstrate overfitting.

In [17]:
# Decision Tree without regularization
tree_no_reg = DecisionTreeClassifier(random_state=42)
tree_no_reg.fit(X_train, y_train)

# Evaluate on training and test sets
y_train_pred = tree_no_reg.predict(X_train)
y_test_pred = tree_no_reg.predict(X_test)

print("Without Regularization:")
print(f"Training Accuracy: {accuracy_score(y_train, y_train_pred)}")
print(f"Test Accuracy: {accuracy_score(y_test, y_test_pred)}")

Without Regularization:
Training Accuracy: 1.0
Test Accuracy: 1.0


### Prevent Overfitting
Apply regularization techniques such as `max_depth`, `min_samples_split`, or `ccp_alpha`.

In [18]:
# Decision Tree with Regularization
tree_with_reg = DecisionTreeClassifier(max_depth=3, min_samples_split=10, random_state=42)
tree_with_reg.fit(X_train, y_train)

# Evaluate on training and test sets
y_train_pred_reg = tree_with_reg.predict(X_train)
y_test_pred_reg = tree_with_reg.predict(X_test)

print("\nWith Regularization:")
print(f"Training Accuracy: {accuracy_score(y_train, y_train_pred_reg)}")
print(f"Test Accuracy: {accuracy_score(y_test, y_test_pred_reg)}")


With Regularization:
Training Accuracy: 0.9523809523809523
Test Accuracy: 1.0


### Compare results using cost complexity pruning

In [19]:
# Cost complexity pruning
path = tree_no_reg.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas[:-1]  # Exclude the maximum alpha
trees = []

# Train trees for different values of alpha
for alpha in ccp_alphas:
    tree = DecisionTreeClassifier(random_state=42, ccp_alpha=alpha)
    tree.fit(X_train, y_train)
    trees.append(tree)

# Evaluate each tree
test_scores = [accuracy_score(y_test, tree.predict(X_test)) for tree in trees]

# Find the best alpha
best_alpha_index = test_scores.index(max(test_scores))
best_tree = trees[best_alpha_index]

print("\nBest Tree (Cost Complexity Pruning):")
print(f"Alpha: {ccp_alphas[best_alpha_index]}")
print(f"Test Accuracy: {test_scores[best_alpha_index]}")


Best Tree (Cost Complexity Pruning):
Alpha: 0.0
Test Accuracy: 1.0
