<a href="https://colab.research.google.com/github/umair594/100-Prediction-Models-/blob/main/Stepwise_Regression_%E2%80%93_Feature_Selection_Through_Iterative_Inclusion_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project 10: Stepwise Regression ‚Äì Feature Selection Through Iterative Inclusion**

**Abstract**

Stepwise Regression is a systematic method for selecting important features in a regression model. It iteratively adds or removes predictors based on statistical criteria such as the p-value or AIC/BIC. This project explores the theory, implementation, and evaluation of Stepwise Regression using Python. The goal is to improve model interpretability, reduce overfitting, and identify the most significant predictors. Implementation on a synthetic dataset demonstrates how stepwise selection helps construct a parsimonious and effective regression model.

**Introduction**

In regression modeling, including all available features can lead to:

Overfitting

Multicollinearity

Poor interpretability

Stepwise Regression addresses this by iteratively including or excluding features based on their contribution to model performance. There are three main approaches:

Forward Selection ‚Äì Start with no predictors, add features one at a time based on improvement in criteria (e.g., p-value, AIC).

Backward Elimination ‚Äì Start with all predictors, remove the least significant feature iteratively.

Bidirectional / Stepwise Selection ‚Äì Combination of forward and backward methods.

This project demonstrates stepwise regression (forward selection) for feature selection in linear regression.

**Theoretical Background**

Linear Regression

Linear regression predicts a continuous response variable
ùë¶
y as a linear combination of predictors
ùëã
1
,
ùëã
2
,
.
.
.
,
ùëã
ùëõ
X
1
	‚Äã

,X
2
	‚Äã

,...,X
n
	‚Äã

:

ùõΩ
ùëñ
Œ≤
i
	‚Äã

 = regression coefficients

ùúñ
œµ = error term

**Stepwise Regression**

Stepwise regression iteratively evaluates feature significance:

Forward Selection: Add features with p-value < threshold (commonly 0.05)

Backward Elimination: Remove features with p-value > threshold

Stepwise: Combination of both

It reduces dimensionality while maintaining predictive power.

**Criteria for Feature Inclusion/Exclusion**

p-value: Probability that coefficient is zero (statistical significance)

AIC / BIC: Penalized likelihood criteria to balance model fit and complexity

**Methodology**

Steps in this project:

Generate a synthetic regression dataset with multiple predictors.

Split the dataset into training and testing sets.

Implement forward stepwise regression using p-values.

Fit the selected features using OLS regression.

Evaluate model performance using R¬≤ and RMSE.

Analyze selected features and their coefficients.

# **Python Implementation**

**Import Libraries**

In [59]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error, r2_score

**Generate Dataset**

In [60]:
# Synthetic regression dataset
X, y = make_regression(
    n_samples=200,
    n_features=10,
    n_informative=5,
    noise=10,
    random_state=42
)

# Convert to DataFrame for easier handling
X = pd.DataFrame(X, columns=[f'X{i}' for i in range(1, 11)])
y = pd.Series(y, name='y')

**Train-Test Split**

In [61]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

**Stepwise Regression Function (Forward Selection)**

In [62]:
def forward_selection(X, y, significance_level=0.05):
    initial_features = []
    remaining_features = list(X.columns)
    selected_features = []

    while remaining_features:
        p_values = pd.Series(index=remaining_features)
        for feature in remaining_features:
            model = sm.OLS(y, sm.add_constant(X[initial_features + [feature]])).fit()
            p_values[feature] = model.pvalues[feature]
        min_p_value = p_values.min()
        if min_p_value < significance_level:
            best_feature = p_values.idxmin()
            initial_features.append(best_feature)
            selected_features.append(best_feature)
            remaining_features.remove(best_feature)
        else:
            break
    return selected_features

**Apply Forward Selection**

In [63]:
selected_features = forward_selection(X_train, y_train)
print("Selected features:", selected_features)

Selected features: ['X4', 'X8', 'X6', 'X3', 'X7']


**Fit Final Model with Selected Features**

In [64]:
X_train_selected = sm.add_constant(X_train[selected_features])
X_test_selected = sm.add_constant(X_test[selected_features])

final_model = sm.OLS(y_train, X_train_selected).fit()
print(final_model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.991
Model:                            OLS   Adj. R-squared:                  0.991
Method:                 Least Squares   F-statistic:                     3341.
Date:                Sat, 07 Feb 2026   Prob (F-statistic):          4.81e-155
Time:                        06:34:29   Log-Likelihood:                -586.10
No. Observations:                 160   AIC:                             1184.
Df Residuals:                     154   BIC:                             1203.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.0306      0.769     -1.341      0.1

**Predictions and Evaluation**

In [65]:
y_pred = final_model.predict(X_test_selected)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("R¬≤:", r2)
print("RMSE:", rmse)

R¬≤: 0.987197260033537
RMSE: 8.782209520477359


**Results and Discussion**

Selected Features: Only statistically significant predictors are included.

Model Fit: R¬≤ indicates proportion of variance explained.

Error Metric: RMSE measures prediction accuracy on test data.

Stepwise regression reduces dimensionality while maintaining predictive performance.

Coefficients indicate the strength and direction of selected predictors.

**Advantages**

Reduces overfitting by excluding irrelevant features.

Improves interpretability of the model.

Automates feature selection in a systematic way.

**Limitations**

Stepwise regression can be greedy ‚Äî may miss optimal feature combination.

Sensitive to p-value threshold.

May not perform well with highly correlated features.

**Conclusion**

Stepwise regression is an effective approach for feature selection in regression modeling. By iteratively adding or removing predictors based on statistical significance, the method produces a parsimonious model that balances interpretability and predictive power. In this project, stepwise regression was implemented in Python and successfully identified the most significant features while maintaining strong model performance.