<a href="https://colab.research.google.com/github/robitussin/CCADMACL_EXERCISES/blob/main/Exercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 1

Use all feature selection methods to find the best features

## Dataset Information

## Features

Number of Instances: 20640

Number of Attributes: 8 numeric, predictive attributes and the target

Attribute Information:

MedInc - median income in block group

HouseAge - median house age in block group

AveRooms - average number of rooms per household

AveBedrms - average number of bedrooms per household

Population - block group population

AveOccup - average number of household members

Latitude - block group latitude

Longitude - block group longitude

## Target
The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

In [64]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split




In [46]:
housing = fetch_california_housing(as_frame=True)
df = pd.concat([housing.data, housing.target], axis=1)
df_target = housing.target
df_features = housing.data

1. Use any filter method to select the best features

In [96]:
# put your answer here
from sklearn.feature_selection import VarianceThreshold
threshold = 5
sl = VarianceThreshold(threshold=threshold)
s = sl.fit(df)

res = df.iloc[:, s.get_support()]
print("Filtered Method")
print("Filtered: ",res.columns)
print()
print("Original: ",df.columns)

tree1 = DecisionTreeRegressor(random_state=0)

X = df[res.columns]
y = housing.target
X_train, X_test1, y_train, y_test1 = train_test_split(X, y,  random_state=1)

tree1.fit(X_train, y_train)
y_pred1 = tree1.predict(X_test1)



Filtered Method
Filtered:  Index(['HouseAge', 'AveRooms', 'Population', 'AveOccup'], dtype='object')

Original:  Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude', 'MedHouseVal'],
      dtype='object')


2. Use any wrapper method to select the best features

In [98]:
# put your answer here
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE

threshold = 5
rfr =RandomForestRegressor(n_estimators=200, random_state=0, max_depth=3)
sl = RFE(rfr, n_features_to_select=threshold, step=1 )

sl = sl.fit(df_features, df_target.values.ravel())
index = sl.get_support()
res = df_features.iloc[:, index]
print("Wrapper Method")
print("Wrapper: ",res.columns)
print()
print("Original: ",df.columns)

tree2 = DecisionTreeRegressor(random_state=0)

X = df[res.columns]
y = housing.target
X_train, X_test2, y_train, y_test2 = train_test_split(X, y,  random_state=1)

tree2.fit(X_train, y_train)
y_pred2 = tree2.predict(X_test2)


Wrapper Method
Wrapper:  Index(['MedInc', 'HouseAge', 'AveRooms', 'AveOccup', 'Latitude'], dtype='object')

Original:  Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude', 'MedHouseVal'],
      dtype='object')


3. Use any embedded methood to select the best features

In [106]:
# put your answer here
from sklearn.feature_selection import SelectFromModel
X_train, X_test3, y_train, y_tes3 = train_test_split(X, y,  random_state=1)
rfr = RandomForestRegressor(n_estimators=500, random_state=0, max_depth=3)

mrf = rfr.fit(X_train, y_train)

sfm = SelectFromModel(mrf, prefit=True)
index = sfm.get_support()

res = df_features.iloc[:, index]

X = df[res.columns]
y = housing.target
print("Embedded Method")
print("Embedded: ",res.columns)
print()
print("Original: ",df.columns)
y_pred3 = mrf.predict(X_test3)



Embedded Method
Embedded:  Index(['MedInc', 'AveOccup'], dtype='object')

Original:  Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',
       'Latitude', 'Longitude', 'MedHouseVal'],
      dtype='object')


In [107]:
from sklearn.metrics import root_mean_squared_error as rmse

treed = DecisionTreeRegressor(random_state=0)
X = housing.data
y = housing.target
X_train, X_testd, y_train, y_testd = train_test_split(X, y,  random_state=1)
treed.fit(X_train, y_train)
y_predd = treed.predict(X_testd)
print("RMSE: ")
print("Default Method: ", rmse(y_testd, y_predd))
print("Filtered Method Variance Threshold: ",rmse(y_test1, y_pred1))
print("Wrapper Method: ", rmse(y_test2, y_pred2))
print("Embedded Method: ",rmse(y_tes3, y_pred3))

RMSE: 
Default Method:  0.7388441108252426
Filtered Method Variance Threshold:  1.3088702515934785
Wrapper Method:  0.832817543080276
Embedded Method:  0.765456706533764
