# Exercise 1

Use all feature selection methods to find the best features

## Dataset Information

## Features

Number of Instances: 20640

Number of Attributes: 8 numeric, predictive attributes and the target

Attribute Information:

MedInc - median income in block group

HouseAge - median house age in block group

AveRooms - average number of rooms per household

AveBedrms - average number of bedrooms per household

Population - block group population

AveOccup - average number of household members

Latitude - block group latitude

Longitude - block group longitude

## Target
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

In [18]:
import pandas as pd
import numpy as np
import pandas as pd
# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For model building
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeClassifier

In [2]:
housing = fetch_california_housing(as_frame=True)
df = pd.concat([housing.data, housing.target], axis=1)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [4]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [5]:
df.columns =[
    'MedInc',
    'HouseAge',
    'AveRooms',
    'AveBedrms',
    'Population',
    'AveOccup',
    'Latitude',
    'Longitude',
    'MedHouseVal'
]

x = df.drop(columns=['MedHouseVal'])
y = df['MedHouseVal']

1. Use any filter method to select the best features

In [6]:
filter_selector = SelectKBest(score_func=f_regression, k=5)
X_new = filter_selector.fit_transform(x, y)

filter_selected_features = x.columns[filter_selector.get_support()]
print("Select features using SelectKBest:"), filter_selected_features


Select features using SelectKBest:


(None,
 Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Latitude'], dtype='object'))

2. Use any wrapper method to select the best features

In [7]:
model = LinearRegression()
rfe_selector = RFE(model, n_features_to_select=5, step=1)
X_new_wrapper = rfe_selector.fit_transform(x, y)

wrapper_selected_features = x.columns[rfe_selector.support_]
print("Select features using RFE:", wrapper_selected_features)

Select features using RFE: Index(['MedInc', 'AveRooms', 'AveBedrms', 'Latitude', 'Longitude'], dtype='object')


3. Use any embedded methood to select the best features

In [23]:
# Replace DecisionTreeClassifier with DecisionTreeRegressor
from sklearn.tree import DecisionTreeRegressor

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

tree = DecisionTreeRegressor()
tree.fit(X_train, y_train)

y_pred = tree.predict(X_test)

feature_importances = tree.feature_importances_
sel_tree_index = feature_importances > 0.1

df_tree = X_train.iloc[:, sel_tree_index]
selected_features_embedded = df_tree.columns.tolist()
print("Select features using Decision Tree:", selected_features_embedded)

Select features using Decision Tree: ['MedInc', 'AveOccup']


In [27]:
from re import X
def compute_rmse(model, X_train, y_train, X_test, y_test):
  model.fit(X_train, y_train)
  y_pred = model.predict(X_test)
  rmse = np.sqrt(mean_squared_error(y_test, y_pred))
  return rmse

default_rmse = compute_rmse(LinearRegression(), X_train, y_train, X_test, y_test)

X_train_filter = X_train[filter_selected_features]
X_test_filter = X_test[filter_selected_features]
filter_rmse = compute_rmse(LinearRegression(), X_train_filter, y_train, X_test_filter, y_test)

X_train_wrapper = X_train[wrapper_selected_features]
X_test_wrapper = X_test[wrapper_selected_features]
wrapper_rmse = compute_rmse(LinearRegression(), X_train_wrapper, y_train, X_test_wrapper, y_test)

X_train_embedded = X_train[selected_features_embedded]
X_test_embedded = X_test[selected_features_embedded]
embedded_rmse = compute_rmse(LinearRegression(), X_train_embedded, y_train, X_test_embedded, y_test)

print("Model RMSE:")
print("Default RMSE:", default_rmse)
print("Filter RMSE:", filter_rmse)
print("Wrapper RMSE:", wrapper_rmse)
print("Embedded RMSE:", embedded_rmse)

Model RMSE:
Default RMSE: 0.7455813830127764
Filter RMSE: 0.7989095969855363
Wrapper RMSE: 0.7528409640011294
Embedded RMSE: 0.8412318369691884
