# Dimensionality Reduction
* Dimensionality reduction is the process of reducing the available features. 
* Model could not be applied on entire set of features directly which may lead to spurious predictions and generalization issues.
* In order to prevent these issues dimensionality reduction is applied.

## Need for dimensionality reduction
Dimensionality reduction prevents overfitting. 
* Overfitting is when the model memorizes the data and fails to generalize. 
* Overfitted model could not be applied to the real world problems due to its generalization problem.

## Types of Dimensionality Reduction
* **Feature Selection**: Feature selection methods attempts to reduce the features by discarding the least important features.
* **Feature Extraction**: Feature extraction methods attempts to reduce the features by combining the features and transforming it to the specified number of features.

## Feature Selection
1. Filter methods
2. Wrapper methods
3. Embedded methods
4. Feature Importance


# Import the required libraries

In [None]:
import numpy as np 
import pandas as pd 
from sklearn.feature_selection import SelectKBest, SelectFromModel, f_regression, chi2
from sklearn.feature_selection import RFE, RFECV
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
import xgboost

pd.set_option('max.rows',500)
pd.set_option('max.columns',80)

In [None]:
'''Segregate the numeric and categoric columns'''
numeric_cols = ['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold', 'SalePrice']

categoric_cols = ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'SaleType', 'SaleCondition']


# Load the preprocessed data
The training data has been preprocessed already. The preprocessing steps involved are,
1. MICE Imputation
2. Log transformation
3. Square root transformation
4. Ordinal Encoding
5. Target Encoding
6. Z-Score Normalization

For detailed implementation of the above mentioned steps refer my notebook on data preprocessing: 

[Notebook Link](https://www.kaggle.com/srivignesh/data-preprocessing-for-house-price-prediction) 

In [None]:
preprocessed_train = pd.read_csv('../input/preprocessed-train-data/preprocessed_train_data.csv')
x_train, y_train = preprocessed_train[preprocessed_train.columns[:-1]], preprocessed_train[preprocessed_train.columns[-1]]
preprocessed_train.head()

In [None]:
'''Segregate numerical and categorical features'''
train_numeric = preprocessed_train[numeric_cols[:-1]]
x_train_numeric, y_train = train_numeric, preprocessed_train[numeric_cols[-1]]
train_categoric = preprocessed_train[categoric_cols]
x_train_categoric, y_train = train_categoric, preprocessed_train[numeric_cols[-1]]

# Feature Selection


# 1. Filter Methods

Filter methods select the features independent of the model used.
It can use the following methods to select the useful set of features,
* Correlation for numeric columns
* Chi2 association for categoric columns

## Select K Best in sklearn

**F_Regression:**

F_Regression is used for numeric columns. F_Regression consists of 2 steps:

* Correlation is computed using each feature with the target.
* The correlation is then converted to an F score then to a p-value.

$Correlation = \frac{\sum (x_{i} - x) (y_{i} - y)}{\sigma _{x} \sigma _{y}} $

**Chi2:**
* Chi2 is used for testing the association between categorical columns.

${\chi}^2 = \frac{\sum \limits_{i=1}^{c} \sum \limits_{j=1}^{r} (o_{ij} - e_{ij})^2 }{e_{ij}}$

$o_{ij}$ = Observed frequency

$e_{ij}$ = expected frequency

In [None]:
'''Use f_regression for numeric columns'''
skb_numeric = SelectKBest(score_func = f_regression, k= 30)
skb_numeric.fit(x_train_numeric, y_train)
'''Get Support (Boolean array) for the columns from the instance'''
columns_selected_skb_numeric = x_train_numeric.columns[skb_numeric.get_support()]
x_train_skb_numeric = pd.DataFrame(skb_numeric.transform(x_train_numeric), columns = columns_selected_skb_numeric, index = x_train_numeric.index )
x_train_skb_numeric.head()

In [None]:
'''Chi2 test doesn't support negative values so square the dataset. Negative values are present in the dataset due to Z-Score normalization'''
x_train_categoric_sqr = x_train_categoric ** 2
'''Use chi2 for categoric columns'''
skb_categoric = SelectKBest(score_func = chi2, k= 30)
skb_categoric.fit(x_train_categoric_sqr, y_train)
'''Get Support (Boolean array) for the columns from the instance'''
columns_selected_skb_categoric = x_train_categoric_sqr.columns[skb_categoric.get_support()]
x_train_skb_categoric = pd.DataFrame(skb_categoric.transform(x_train_categoric), columns = columns_selected_skb_categoric, index =x_train_categoric.index )
x_train_skb_categoric.head()

In [None]:
'''Concatenate the selected features of numeric and categoric columns'''
x_train_skb = pd.concat([x_train_skb_numeric ,x_train_skb_categoric], axis =1)
x_train_skb.head()

# 2. Wrapper methods

Wrapper methods make use of an estimator to select the useful set of features. The techniques available are,
* Recursive Feature Elimination
* Recursive Feature Elimination Cross Validation


## Recursive Feature Elimination (RFE)

* The estimator that is provided to RFE assigns weights to features (e.g., the coefficients), RFE recursively eliminates subset of features which have low weights assigned to it.
* The estimator is trained with the initial set of features. The estimator might have attributes such as coef_ or feature_importances_. With that attribute of the estimator we find the weights of each feature.
* The least weighted features are removed from the current set of features. This procedure is repeated on the removed set until the specified number of features to select is finally reached.

In [None]:
xgb_model = xgboost.XGBRegressor(objective="reg:squarederror", random_state=42)
rfe = RFE(estimator= xgb_model, n_features_to_select = 25)
rfe.fit(x_train,y_train)
'''Select the columns that are already selected by RFE'''
columns_selected_rfe = x_train.columns[rfe.support_]
x_train_rfe = pd.DataFrame(rfe.transform(x_train), columns = columns_selected_rfe, index = x_train.index)
x_train_rfe.head()

## Recursive Feature Elimination Cross Validation (RFECV)
* RFECV is very similar to RFE but it uses Cross Validation at each training phase and finally outputs optimal number of columns to select.

In [None]:
rfecv = RFECV(estimator=xgb_model)
rfecv.fit(x_train, y_train)
'''Select the columns that are already selected by RFECV'''
columns_selected_rfecv = x_train.columns[rfecv.support_]
x_train_rfecv = pd.DataFrame(rfecv.transform(x_train), columns = columns_selected_rfecv, index = x_train.index)
x_train_rfecv.head()

# 3. Embedded Methods
Embedded methods select features during the training process itself. 
* The coefficients of features become zero when the importance of that feature is low and therefore that feature is not utilized to make predictions.
 
## LASSO regression
LASSO stands for **Least Absolute Shrinkage and Selection Operator**

![](https://www.statisticshowto.com/wp-content/uploads/2015/09/lasso-regression.png)

$\lambda$ = Penalty (Tuning Parameter)

When $\lambda$ = 0 no parameters are eliminated and when $\lambda$ = 1 it is equal to linear regression.

* The parameter estimates are found by minimizing this cost function.
* When the coefficient estimates are less than $\lambda / 2$ the coefficients become zero.


In [None]:
lasso = Lasso(alpha = 0.3)
'''fit a LASSO model'''
lasso.fit(x_train, y_train)
'''To only select based on max_features, set threshold=-np.inf. Set prefit = True if the model is already fitted to the dataset.'''
sfm_lasso = SelectFromModel(estimator=lasso, prefit= True, max_features=65, threshold=-np.inf)
lasso_selected_columns = x_train.columns[sfm_lasso.get_support()]
x_train_lasso = pd.DataFrame(sfm_lasso.transform(x_train), columns = lasso_selected_columns, index = x_train.index)
x_train_lasso.head()

# 4. Feature Importance
The feature importances is calculated after fitting the model to the entire set of features which assigns weights to each of the features. 
* The model might have attributes such as coef_ or feature_importances_ which help to select the subset of features. Using this the least important features are pruned.

## Select From Model in sklearn

In [None]:
'''Any regressor can be used. Here Decision Tree is used'''
dec_tree_model = DecisionTreeRegressor()
dec_tree_model.fit(x_train, y_train)

'''To only select based on max_features, set threshold=-np.inf. Set prefit = True if the model is already fitted to the dataset.'''
sfm = SelectFromModel(estimator=dec_tree_model, prefit= True, max_features=65, threshold=-np.inf)

'''Selected columns'''
columns_selected_sfm = preprocessed_train.columns[:-1][sfm.get_support()]
x_train_sfm = pd.DataFrame(sfm.transform(x_train), columns = columns_selected_sfm, index = x_train.index)
x_train_sfm.head()