## Introduction
Welcome back! In this lesson, we will delve into another powerful technique for feature selection — SelectFromModel. This technique is particularly useful when you have a trained model and want to select the most important features based on the model's importance criterion.

SelectFromModel is a meta-transformer that can be used along with any estimator that assigns importance to each feature through a specific attribute (like coef_ or feature_importances_). Later, you can set a threshold, and SelectFromModel will consider those features whose importance is more than this threshold.

So, in essence, SelectFromModel does the heavy lifting of identifying and choosing the right features based on the model's importance criterion — a significant advantage for any machine learning practitioner!

## Exploring the California Housing Dataset
While discussing dimensionality reduction and feature selection remains crucial, these theories gain relevance and become more comprehensible when we apply them to real-life datasets. For the purpose of our lesson, we shall work with the California Housing dataset available in scikit-learn's set of datasets.

Let's begin by loading and briefly exploring the dataset:

In [3]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing

# Load California housing dataset
housing = fetch_california_housing()
X = housing.data
Y = housing.target

The California housing dataset encapsulates data on multiple variables like the average number of rooms, average income, population, etc., in various housing blocks in California. Each of these variables might exhibit a distinct influence on the median housing prices, our target variable in this dataset. Managing such a dataset makes feature selection quite necessary!

After loading the dataset, we can split the data into training and testing sets using the train_test_split function from Scikit-learn, so that we can train our model on the training data and evaluate it on the testing data. We will skip this step in this lesson, and focus only on applying SelectFromModel for feature selection.

## Creating and Training a Linear Regression Model
Before moving on to feature selection, let's understand why we need a model to perform feature selection. The model helps us determine the importance of each feature in predicting the target variable. This importance is then used by SelectFromModel to select the most important features. Hence, the model acts as a guiding light in our feature selection journey.

For demo purposes, we'll use a simple Linear Regression model where it will learn the relationship between the independent variables (features) and the dependent variable (housing prices).

Let's fit our model:

In [5]:
from sklearn.linear_model import LinearRegression

# Fitting linear regression model on the data
lr = LinearRegression()
lr.fit(X, Y)

Great! Now that we have a trained Linear Regression model, let's use it with SelectFromModel to perform feature selection.

## Performing Feature Selection using `SelectFromModel`
Let's use our trained model with SelectFromModel, which will use the Linear Regression's coefficients to determine the importance of features. Features having coefficients greater than a pre-defined threshold will be considered important.

Here's how we do it:



In [7]:
from sklearn.feature_selection import SelectFromModel

# Applying SelectFromModel
sfm = SelectFromModel(lr)
sfm.fit(X, Y)

With these steps, we instructed SelectFromModel to analyze all features in our dataset, determine their importance through our Linear Regression model, and select features that are considered important by our model.

## Interpreting and Validating Selected Features
Finally, we arrive at the step where we uncover the most important features in our dataset according to SelectFromModel. We do this by calling sfm.get_support(indices=True), which returns an array with indices of features that are important. These indices are used to get the corresponding feature names from the dataset.

Let's unveil our most important features:

In [8]:
# Printing the names of the most important features
for feature_list_index in sfm.get_support(indices=True):
    print(housing.feature_names[feature_list_index])

MedInc
AveBedrms
Latitude
Longitude


This output indicates that among all the features provided by the California Housing dataset, Median Income (MedInc), Average Bedrooms (AveBedrms), Latitude, and Longitude have been identified as the most important features when determining housing prices. This filtering allows for a more focused analysis and model training with these key features.

Congratulations, you now have a list of most important features that you can further use in training a more efficient and accurate model. Reducing your data dimensionality using SelectFromModel doesn't look challenging now, does it?

## Applying Custom Thresholds
The default threshold for SelectFromModel is the mean of the feature importances, but you can set your own threshold using the threshold parameter. For instance, if you want to set a threshold of 0.6, you can do so by:



In [9]:
sfm = SelectFromModel(lr, threshold=0.6)
sfm.fit(X, Y)

for feature_list_index in sfm.get_support(indices=True):
    print(housing.feature_names[feature_list_index])

AveBedrms


## Lesson Summary
In today's lesson, we dived deep into feature selection using Scikit-learn's SelectFromModel. We started at understanding why feature selection is significant and how SelectFromModel helps achieve it. We used the California Housing dataset and demonstrated each step from training the model to extracting the most important features using SelectFromModel.

Residing on theory won't do much good unless you get your hands dirty with code. Hence, the upcoming exercises are carefully designed for you that will solidify your understanding of this topic. Remember, practice is the key. Happy coding!

## Revealing Key Features in California Housing Prices

Are you curious about how we can pinpoint the most influential features in predicting housing prices in California? The given code loads the housing dataset, fits a Linear Regression model, and uses SelectFromModel with the default threshold to select the most important features. Click Run to see which features make the cut!

In [10]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectFromModel

# Load California housing dataset
housing = fetch_california_housing()
X = housing.data
Y = housing.target

# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Fit a Linear Regression model on the training data
lr = LinearRegression()
lr.fit(X_train, Y_train)

# Apply SelectFromModel with the default threshold to perform feature selection
sfm = SelectFromModel(lr, max_features=4)
sfm.fit(X_train, Y_train)

# Printing the names of the most important features
for feature_list_index in sfm.get_support(indices=True):
    print(housing.feature_names[feature_list_index])

MedInc
AveBedrms
Latitude
Longitude


## Adjusting Feature Selection Threshold

## Implanting SelectFromModel in the Voyage of Feature Selection