<a href="https://colab.research.google.com/github/swethag04/ml-projects/blob/main/linear-regression/permutation_importance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The dataset has real estate price details in California. The goal is to build a regression model to predict the price of a house in California and interpret the model.

In [None]:
import pandas as pd
import numpy as np
from sklearn.inspection import permutation_importance
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer, ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import plotly.express as px

In [None]:
cali = pd.read_csv('sample_data/housing.csv')
cali.head(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [None]:
cali.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [None]:
cali.shape

(20640, 10)

In [None]:
print(cali.isnull().sum())

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64


In [None]:
# Dropping missing values as it is only 1% of total
cali = cali.dropna()

In [None]:
X = cali.drop(['median_house_value'], axis=1)
y = cali['median_house_value']

In [None]:
X_numeric = X.select_dtypes(include=[np.number]).columns
X_categorical = X.select_dtypes(exclude=[np.number]).columns

In [None]:
# Creating pipelines
num_pipe =Pipeline([('scaler', StandardScaler())])
cat_pipe = Pipeline([('encoder', OneHotEncoder())])

In [None]:
# Combining the preprocessing pipelines using column Transformer
preprocessor = ColumnTransformer([('numeric', num_pipe, X_numeric),
                                  ('encoder', cat_pipe, X_categorical)])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape)
print(X_test.shape)

(14303, 9)
(6130, 9)


In [None]:
linreg = LinearRegression()
mses = []
for i in range(1,5):
  pipeline = Pipeline([('preprocessor', preprocessor),
                       ('poly', PolynomialFeatures(degree=i, include_bias=False)),
                       ('linreg', linreg)])
  pipeline.fit(X_train, y_train)
  mse = mean_squared_error(y_test, pipeline.predict(X_test))
  mses.append(mse)
print(mses)

[4614164009.958697, 2.440621291477511e+26, 4.2187222671943187e+21, 5.187188673016255e+20]


In [None]:
best_model = mses.index(min(mses))+1
best_mse = min(mses)
print(f'The best degree polynomial model is: {best_model}')
print(f'The smallest mse is: {best_mse}')

The best degree polynomial model is: 1
The smallest mse is: 4614164009.958697


**Permutation Feature importance** is a model inspection technique. It is defined to be the decrease in a model score when a single feature value is randomly shuffled. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature.

In [None]:
# Finding the important features
best_pipe = Pipeline([('preprocessor', preprocessor),
                       ('poly', PolynomialFeatures(degree=1, include_bias=False)),
                       ('linreg', linreg)])
best_pipe.fit(X_train, y_train)
r = permutation_importance(best_pipe, X_test, y_test, n_repeats=30, random_state=0)
importance_df = pd.DataFrame({'Feature': X.columns,
                              'Importance': r.importances_mean})

# Sort the DataFrame by importance values
importance_df = importance_df.sort_values(by='Importance', ascending=False)
print(importance_df)


              Feature  Importance
7       median_income    0.838405
0           longitude    0.432648
1            latitude    0.421659
4      total_bedrooms    0.301819
5          population    0.236796
8     ocean_proximity    0.055409
6          households    0.034032
3         total_rooms    0.030791
2  housing_median_age    0.027352


In [None]:
fig = px.bar(importance_df, x="Feature", y="Importance", width=600, height=500)
fig.show()

1. `median_income` is the most important feature to predict the house price   
2.  The geographical coordintaes(`longitude` and `latitude` ) follow income as the 2nd most important feature
3. `Total_bedrooms` is the next important feature
