# Predicting sale price of houses

The aim of this project is to build a machine learning model to predict sale price of houses, based on multiple explanatory variables describing aspects of these houses.
The dataset used for this project is available on [Kaggle.com](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)


# House price prediction : Feature selection

This notebook is the third step of our project which steps features :
- 1\.  Data analysis
- 2\.  Feature engineering
- **3\.  Feature selection**
- 4\.  Model building

In the following, we will select a group of variables, the most predictive ones, to build our model.

In [6]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

import math

In [3]:
X_train = pd.read_csv('xtrain.csv')
X_test = pd.read_csv('xtest.csv')

X_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
0,931,0.0,0.75,0.461171,0.377048,1.0,1.0,0.333333,1.0,1.0,...,1.0,0.0,0.545455,0.75,0.666667,0.75,12.21106,0.0,0.0,0.0
1,657,0.0,0.75,0.456066,0.399443,1.0,1.0,0.333333,0.333333,1.0,...,1.0,0.0,0.636364,0.5,0.666667,0.75,11.887931,0.0,0.0,0.0
2,46,0.588235,0.75,0.394699,0.347082,1.0,1.0,0.0,0.333333,1.0,...,1.0,0.0,0.090909,1.0,0.666667,0.75,12.675764,0.0,0.0,0.0
3,1349,0.0,0.75,0.388581,0.493677,1.0,1.0,0.666667,0.666667,1.0,...,1.0,0.0,0.636364,0.25,0.666667,0.75,12.278393,1.0,0.0,0.0
4,56,0.0,0.75,0.577658,0.402702,1.0,1.0,0.333333,0.333333,1.0,...,1.0,0.0,0.545455,0.5,0.666667,0.75,12.103486,0.0,0.0,0.0


In [5]:
y_train = X_train['SalePrice']
y_test = X_test['SalePrice']

X_train.drop(['Id', 'SalePrice'], axis=1, inplace=True)
X_test.drop(['Id', 'SalePrice'], axis=1, inplace=True)

## Feature selection

In [7]:
#random_state = 0

sel_ = SelectFromModel(Lasso(alpha=0.005, random_state=0))

sel_.fit(X_train, y_train)

SelectFromModel(estimator=Lasso(alpha=0.005, copy_X=True, fit_intercept=True,
                                max_iter=1000, normalize=False, positive=False,
                                precompute=False, random_state=0,
                                selection='cyclic', tol=0.0001,
                                warm_start=False),
                max_features=None, norm_order=1, prefit=False, threshold=None)

In [8]:
sel_.get_support()

array([ True,  True, False, False, False, False, False, False, False,
       False, False,  True, False, False, False, False,  True,  True,
       False,  True,  True, False, False, False,  True, False, False,
       False, False,  True, False,  True, False, False, False, False,
       False, False, False,  True,  True, False,  True, False, False,
        True,  True, False, False, False, False, False,  True, False,
       False,  True,  True,  True, False,  True,  True, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])

In [10]:
selected_feats = X_train.columns[(sel_.get_support())]

print("Total features : ", len(X_train.columns))
print("Selected features : ", len(selected_feats))
print("List of selected features : ", list(selected_feats))

Total features :  82
Selected features :  22
List of selected features :  ['MSSubClass', 'MSZoning', 'Neighborhood', 'OverallQual', 'OverallCond', 'YearRemodAdd', 'RoofStyle', 'MasVnrType', 'BsmtQual', 'BsmtExposure', 'HeatingQC', 'CentralAir', '1stFlrSF', 'GrLivArea', 'BsmtFullBath', 'KitchenQual', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageCars', 'PavedDrive']


In [18]:
selected_feats = X_train.columns[(sel_.estimator_.coef_ != 0).ravel().tolist()]

In [20]:
pd.Series(selected_feats).to_csv("selected_features.csv", index=False)

  """Entry point for launching an IPython kernel.
