#### Feature Selection
##### In this notebook, I experiment with different feature selection models to par down the number of features in the final model. Before doing so, I convert all categorical data to dummy variables, and standardize the data.

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from numpy.random import RandomState
warnings.filterwarnings('ignore')

In [2]:
train = pd.read_csv("/Users/Triveni/Desktop/dataScience/data/train.csv")

In [3]:
catVars = [col for col in list(train) if train[col].dtype=="object"]
trainPlusDummies = pd.get_dummies(data=train,columns=catVars) # create dummy variables
trainPlusDummies.dropna(inplace=True)

In [4]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

In [5]:
X = trainPlusDummies.drop(columns=["SalePrice","Id"])
Y = trainPlusDummies.SalePrice
featureNames = list(X)
x_train,x_test,y_train,y_test = sklearn.model_selection.train_test_split(X,Y)
X_scaler = StandardScaler() #create the scaler and apply it to the train and test sets 
x_train = X_scaler.fit_transform(x_train)
x_test = X_scaler.transform(x_test)

##### I first try recursive feature elimination using a vanilla linear regression for the estimator. The rfe score is quite low. 

In [6]:
linearModel = LinearRegression()
rfe = RFECV(estimator=linearModel,cv=5)
rfe.fit(x_train, y_train)
rfe.score(x_test,y_test)

0.46405107062861578

##### Then I try the same feature elimination using a Lasso model. I set the random state ahead, so that my results can be reproduced. The results are better than the previous model but not enough to instill confidence in the feature selection fit.

In [7]:
from sklearn.linear_model import Lasso
lasso = Lasso(random_state=np.random.RandomState(5))
rfe = RFECV(estimator=lasso,cv=5)
rfe.fit(x_train, y_train)
rfe.score(x_test,y_test)

0.61359545327804965

##### Finally, I use a Random Forest Regressor to find optimal features. I set the max number of features to the square root of the total number of features to reasonably constrain the data. This model provides the best score of all the models so far.

In [8]:
R = np.random.RandomState(3)
rf = RandomForestRegressor(random_state = 3,max_features='sqrt')
rf.fit(x_train,y_train)
rf.score(x_test,y_test)

0.81588047216909565

##### For reference, the 'rfAll' variable shows an almost equal score for a Random Forest Regressor using all features in the dataset. 

In [9]:
R = np.random.RandomState(3)
rfAll = RandomForestRegressor(random_state = 3)
rfAll.fit(x_train,y_train)
rfAll.score(x_test,y_test)

0.82775069916268917