### Question 1:  Best-subset regression and lasso regression
Consider the `Boston` data set in package `MASS`.  Let 'medv' be the response variable.
- 'crim'  per capita crime rate by town.
- 'zn' proportion of residential land zoned for lots over 25,000 sq.ft.
- 'indus' proportion of non-retail business acres per town.
- 'chas' Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- 'nox' nitrogen oxides concentration (parts per 10 million).
- 'rm' average number of rooms per dwelling.
- 'age' proportion of owner-occupied units built prior to 1940.
- 'dis' weighted mean of distances to five Boston employment centres.
- 'rad' index of accessibility to radial highways.
- 'tax' full-value property-tax rate per $10,000.
- 'ptratio' pupil-teacher ratio by town.
- 'black' 1000$(Bk-0.63)^2$ where Bk is the proportion of blacks by town.
- 'lstat' lower status of the population (percent).
- 'medv' median value of owner-occupied homes in $1000s.


In [1]:
import statsmodels.api as sm
Boston = sm.datasets.get_rdataset("Boston", "MASS").data
print(Boston.columns)
Boston.dropna(inplace=True)
print(Boston.shape)

Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'black', 'lstat', 'medv'],
      dtype='object')
(506, 14)


(1) Consider all regression models without interaction effects. Use 10-fold cross validation to conduct best-subset regression.  What is your best model?   If you could not install `abess` package, use forward selection or backward elimination mehtod.

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")

# Features and response
X = Boston.drop(columns='medv')
y = Boston['medv']

def adjusted_r2(model, X, y):
    r2 = model.rsquared
    n = X.shape[0]
    p = X.shape[1] - 1  # subtract intercept
    return 1 - ((1 - r2) * (n - 1)) / (n - p - 1)

def forward_selection(X, y):
    remaining = list(X.columns)
    selected = []
    current_score, best_new_score = 0.0, 0.0
    best_model = None

    while remaining:
        scores_with_candidates = []
        for candidate in remaining:
            formula = 'medv ~ ' + ' + '.join(selected + [candidate])
            model = sm.OLS.from_formula(formula, pd.concat([X, y], axis=1)).fit()
            adj_r2 = adjusted_r2(model, X[selected + [candidate]], y)
            scores_with_candidates.append((adj_r2, candidate, model))

        scores_with_candidates.sort(reverse=True)
        best_new_score, best_candidate, best_model = scores_with_candidates[0]

        if best_new_score > current_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
        else:
            break

    return selected, best_model

selected_features, best_model = forward_selection(X, y)

print("Selected features:", selected_features)
print(best_model.summary())


Selected features: ['lstat', 'rm', 'ptratio', 'dis', 'nox', 'chas', 'black', 'zn', 'crim', 'rad', 'tax']
                            OLS Regression Results                            
Dep. Variable:                   medv   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     117.3
Date:                Thu, 24 Apr 2025   Prob (F-statistic):          6.08e-136
Time:                        19:55:01   Log-Likelihood:                -1498.8
No. Observations:                 506   AIC:                             3024.
Df Residuals:                     493   BIC:                             3079.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
--------------------------

(2) Consider all regression models without interaction effects. Use lasso regression with 10-fold cross validation to choose the tuning parameter.  What is the final model?

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
# Features and response
X = Boston.drop(columns='medv')
y = Boston['medv']

# Standardize predictors
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Lasso with 10-fold CV
lasso_cv = LassoCV(cv=10, random_state=42)
lasso_cv.fit(X_scaled, y)

# Print selected alpha and coefficients
print("Best alpha:", lasso_cv.alpha_)

# Get coefficients and corresponding feature names
coef = pd.Series(lasso_cv.coef_, index=X.columns)
selected_features = coef[coef != 0].index.tolist()

print("\nSelected features:", selected_features)
print("\nLasso coefficients:")
print(coef[coef != 0])



Best alpha: 0.14602012128965022

Selected features: ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat']

Lasso coefficients:
crim      -0.496429
zn         0.537587
indus     -0.059797
chas       0.646017
nox       -1.357844
rm         2.895350
dis       -2.104315
rad        0.525318
tax       -0.284227
ptratio   -1.860406
black      0.721842
lstat     -3.720770
dtype: float64


### Question 2  In this question, we will predict the number of applications received using the other variables in the `College` data set (use the first column as index).

The data set contains a number of variables for 777 different universities and colleges in the US. The variables are
- `Private`: Public/private indicator
- `Apps`: Number of applications received
- `Accept`: Number of applicants accepted
- `Enroll`: Number of new students enrolled
- `Top10perc`: New students from top 10% of high school class
- `Top25perc`: New students from top 25% of high school class
- `F.Undergrad`: Number of full-time undergraduates
- `P.Undergrad`: Number of part-time undergraduates
- `Outstate`: Out-of-state tuition
- `Room.Board`: Room and board costs
- `Books`: Estimated book costs
- `Personal`: Estimated personal spending
- `PhD`: Percent of faculty with Ph.D.’s
- `Terminal`: Percent of faculty with terminal degree
- `S.F.Ratio`: Student/faculty ratio
- `perc.alumni`: Percent of alumni who donate
- `Expend`: Instructional expenditure per student
- `Grad.Rate`: Graduation rate

(1) Update the data by converting `Private` as a dummy variable.

In [8]:
import pandas as pd
import numpy as np

# Load the dataset
college = pd.read_csv("College.csv", index_col=0)

# Convert 'Private' to dummy variable: 1 for 'Yes', 0 for 'No'
college['Private'] = college['Private'].map({'Yes': 1, 'No': 0})

# Check the transformation
print(college['Private'].value_counts())
print(college.head())


Private
1    565
0    212
Name: count, dtype: int64
                              Private  Apps  Accept  Enroll  Top10perc  \
Abilene Christian University        1  1660    1232     721         23   
Adelphi University                  1  2186    1924     512         16   
Adrian College                      1  1428    1097     336         22   
Agnes Scott College                 1   417     349     137         60   
Alaska Pacific University           1   193     146      55         16   

                              Top25perc  F.Undergrad  P.Undergrad  Outstate  \
Abilene Christian University         52         2885          537      7440   
Adelphi University                   29         2683         1227     12280   
Adrian College                       50         1036           99     11250   
Agnes Scott College                  89          510           63     12960   
Alaska Pacific University            44          249          869      7560   

                            

(2) The following Python code split the data set into a training set and a test set after the original variable `Private` is dropped.

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
seed=100
X = college.drop("Apps", axis=1)  
y = college["Apps"]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=seed, shuffle=True)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
#transform() applies precomputed scaling parameters on the training data 

(3) Fit a lasso model on the training set, with $\lambda$ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

In [20]:
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error
import numpy as np

# Fit Lasso model with 10-fold CV
lasso = LassoCV(cv=10, random_state=100)
lasso.fit(X_train, y_train)

# Predict on test set
y_pred = lasso.predict(X_test)

# Compute test error (MSE)
test_mse = mean_squared_error(y_test, y_pred)

# Number of non-zero coefficients
nonzero_coefs = np.sum(lasso.coef_ != 0)

# Report results
print(f"Best alpha (lambda): {lasso.alpha_:.4f}")
print(f"Test MSE: {test_mse:.2f}")
print(f"Number of non-zero coefficients: {nonzero_coefs}")


Best alpha (lambda): 13.7486
Test MSE: 895794.84
Number of non-zero coefficients: 14


(4) Fit a PCR model on the training set, with $M$ chosen by crossvalidation.
Report the test error obtained, along with the value of $M$ selected by cross-validation.

In [19]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

# Fit a PCR model with cross-validation to choose M
best_m = 0
lowest_mse = np.inf
mse_scores = []

for m in range(1, X_train.shape[1] + 1):
    # Create a pipeline with PCA and Linear Regression
    pcr = make_pipeline(PCA(n_components=m), LinearRegression())
    
    # Compute cross-validation MSE
    pcr_mse = cross_val_score(pcr, X_train, y_train, cv=10, scoring='neg_mean_squared_error')
    mse_scores.append(np.mean(-pcr_mse))  # MSE is negative, so we negate to get positive values
    
    # Track the best number of components based on MSE
    if np.mean(-pcr_mse) < lowest_mse:
        lowest_mse = np.mean(-pcr_mse)
        best_m = m

# Fit PCR model with the best M
pcr_best = make_pipeline(PCA(n_components=best_m), LinearRegression())
pcr_best.fit(X_train, y_train)

# Predict on the test set
y_pred = pcr_best.predict(X_test)

# Compute test error (MSE)
test_mse = mean_squared_error(y_test, y_pred)

# Report results
print(f"Best number of components (M): {best_m}")
print(f"Test MSE: {test_mse:.2f}")


Best number of components (M): 17
Test MSE: 913034.36


(5) Fit a PLS model on the training set, with $M$ chosen by crossvalidation.
Report the test error obtained, along with the value of $M$ selected by cross-validation.

In [18]:
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

# Fit a PLS model with cross-validation to choose M
best_m = 0
lowest_mse = np.inf

for m in range(1, X_train.shape[1] + 1):
    # Create a PLS model with the specified number of components
    pls = PLSRegression(n_components=m)
    
    # Compute cross-validation MSE (note we don't need LinearRegression in the pipeline)
    pls_mse = cross_val_score(pls, X_train, y_train, cv=10, scoring='neg_mean_squared_error')
    
    # Track the best number of components based on MSE
    mean_mse = np.mean(-pls_mse)  # Negate since scoring returns negative MSE
    if mean_mse < lowest_mse:
        lowest_mse = mean_mse
        best_m = m

# Fit PLS model with the best M
pls_best = PLSRegression(n_components=best_m)
pls_best.fit(X_train, y_train)

# Predict on the test set
y_pred = pls_best.predict(X_test)

# Compute test error (MSE)
test_mse = mean_squared_error(y_test, y_pred)

# Report results
print(f"Best number of components (M): {best_m}")
print(f"Test MSE: {test_mse:.2f}")




Best number of components (M): 16
Test MSE: 913025.02


(6) Comment on the results obtained. How accurately can we predict
the number of college applications received? Is there much
difference among the test errors resulting from these four approaches?

**Ans**:

The Lasso model had the lowest test MSE, so the lasso model was the best at predicting the number of college applications. However, there was not an immense difference between the models used