# ML Model Pipeline: Feature Engineering

**Predicting House Sale Price**

The aim of the project is to develop a machine learning model to predict house sale prices using features that describe various aspects of a house.

**Goals**

- To be able to understand the features that affect the price of houses in the Market.
- To determine the significant features on which the house price depend on.
- To develop a model that predicts the price of a house using the selected/important factors.

**Data Source:**
- Ames Housing dataset 
    - This dataset originates from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data
    
**Credits:**
- www.kaggle.com - This project wouldnt have been possible without Ames dataset from www.kaggle.com
- https://feature-engine.readthedocs.io/en/latest/_modules/feature_engine/categorical_encoders.html#OrdinalCategoricalEncoder

We'll use Feature Engineering to handle the following:
 - Missing values
 - Temporal variables
 - Skewed distribution
 - Categorical variables
 - Variable standardization

## Import Libraries

In [1]:
#libs for data processing
import pandas as pd
import numpy as np

#libs for plotting
import matplotlib.pyplot as plt
import seaborn as sns

#size plot/graph dimensions for matplotlib
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6

#Display all the columns
pd.pandas.set_option('display.max_columns',None)

#Display matplotlib output inline.
%matplotlib inline

#Split Data
from sklearn.model_selection import train_test_split

#Feature scaling
from sklearn.preprocessing import MinMaxScaler

#suppress warnings
import warnings
warnings.simplefilter(action='ignore')

# label encoding the data 
from sklearn.preprocessing import LabelEncoder 

## Read Data

In [2]:
try:
    df = pd.read_csv('houseprice.csv')
    print(f'[SUCCESS] Done loading the dataset...')
    
except:
    print(f'Unable to load the dataset!')

[SUCCESS] Done loading the dataset...


In [3]:
#### [1] Visualize the table/Dataframe
df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


## Split Dataset into Train and Test sets

We’ll split the dataset into Train and Test sets before feature engineering to prevent Data Leakage. Data leakage leads to the sharing of information between Train and test datasets.

In [4]:
#split
#X = df.drop('SalePrice', axis=1)
X = df
y = df['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

print(f'Shape of the Train set : {X_train.shape}.')
print(f'Shape of the Test set : {X_test.shape}.')

Shape of the Train set : (1314, 81).
Shape of the Test set : (146, 81).


## Missing values

#### Categorical Variables
 - Let's replace missing data for categorical variables with the word 'missingVal'.

In [5]:
#Create a list containing categorical variables with the missing values
catVars_with_missingVal = [var for var in df.columns if X_train[var].isnull().sum() > 0 and X_train[var].dtype == 'object']

dfCount = pd.DataFrame()#initialize empty df
for var in catVars_with_missingVal:
    k = sum(pd.isnull(df[var]))
    perc = round(((sum(pd.isnull(df[var]))/len(df))*100),2)
    
    #print columns with misssing values with their corresponding 'sum of the missing values' & 'percent of the missing values
    if perc > 0:
        print('|',var, ':',k, '|',perc,'% of data is missing |')
        dfCount = dfCount.append([var])#append column with missing value to the empty df above(dfCount df)
    else:
        pass

print('\n')
print(f'There is {len(dfCount)} categorical variables with missing values.')

| Alley : 1369 | 93.77 % of data is missing |
| MasVnrType : 8 | 0.55 % of data is missing |
| BsmtQual : 37 | 2.53 % of data is missing |
| BsmtCond : 37 | 2.53 % of data is missing |
| BsmtExposure : 38 | 2.6 % of data is missing |
| BsmtFinType1 : 37 | 2.53 % of data is missing |
| BsmtFinType2 : 38 | 2.6 % of data is missing |
| Electrical : 1 | 0.07 % of data is missing |
| FireplaceQu : 690 | 47.26 % of data is missing |
| GarageType : 81 | 5.55 % of data is missing |
| GarageFinish : 81 | 5.55 % of data is missing |
| GarageQual : 81 | 5.55 % of data is missing |
| GarageCond : 81 | 5.55 % of data is missing |
| PoolQC : 1453 | 99.52 % of data is missing |
| Fence : 1179 | 80.75 % of data is missing |
| MiscFeature : 1406 | 96.3 % of data is missing |


There is 16 categorical variables with missing values.


##### Replace missing data for categorical variables with 'missingVal'

In [6]:
X_train[catVars_with_missingVal] = X_train[catVars_with_missingVal].fillna('missingVal')
X_test[catVars_with_missingVal] = X_test[catVars_with_missingVal].fillna('missingVal')

In [7]:
#verify that there is no missing values in X_train dataset
#option 1
X_train[catVars_with_missingVal].isnull().sum()

#option2
#[var for var in catVars_with_missingVal if X_train[var].isnull().sum() > 0]

Alley           0
MasVnrType      0
BsmtQual        0
BsmtCond        0
BsmtExposure    0
BsmtFinType1    0
BsmtFinType2    0
Electrical      0
FireplaceQu     0
GarageType      0
GarageFinish    0
GarageQual      0
GarageCond      0
PoolQC          0
Fence           0
MiscFeature     0
dtype: int64

In [8]:
#verify that there is no missing values in X_test dataset
#option 1
X_test[catVars_with_missingVal].isnull().sum()

#option2
#[var for var in catVars_with_missingVal if X_test[var].isnull().sum() > 0]

Alley           0
MasVnrType      0
BsmtQual        0
BsmtCond        0
BsmtExposure    0
BsmtFinType1    0
BsmtFinType2    0
Electrical      0
FireplaceQu     0
GarageType      0
GarageFinish    0
GarageQual      0
GarageCond      0
PoolQC          0
Fence           0
MiscFeature     0
dtype: int64

#### Numeric Variables

##### Replace missing data for numeric variables with the mode
 - Let's replace missing values with the mode

In [9]:
#Create a list containing numeric variables with missing values
numVars_with_missingVal = [var for var in df.columns if X_train[var].isnull().sum() > 0 and X_train[var].dtype != 'object']

dfCount = pd.DataFrame()#initialize empty df
for var in numVars_with_missingVal:
    k = sum(pd.isnull(df[var]))
    perc = round(((sum(pd.isnull(df[var]))/len(df))*100),2)
    
    #print columns with misssing values with their corresponding 'sum of the missing values' & 'percent of the missing values
    if perc > 0:
        print('|',var, ':',k, '|',perc,'% of data is missing |')
        dfCount = dfCount.append([var])#append column with missing value to the empty df above(dfCount df)
    else:
        pass

print('\n')
print(f'There is {len(dfCount)} numeric variables with missing values.')

| LotFrontage : 259 | 17.74 % of data is missing |
| MasVnrArea : 8 | 0.55 % of data is missing |
| GarageYrBlt : 81 | 5.55 % of data is missing |


There is 3 numeric variables with missing values.


In [10]:
for var in numVars_with_missingVal:

    # compute the mode using the X_train set
    mode_val = X_train[var].mode()[0]#'[0]'- returns mode-value at index 0 especially in cases where you have various/different classes with the same mode

   # replace missing values by the mode
    X_train[var] = X_train[var].fillna(mode_val)
    X_test[var] = X_test[var].fillna(mode_val)

In [11]:
#verify that there is no missing values in X_train dataset
#option 1
X_train[numVars_with_missingVal].isnull().sum()

#option2
#[var for var in numVars_with_missingVal if X_train[var].isnull().sum() > 0]

LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64

In [12]:
#verify that there is no missing values in X_test dataset
#option 1
X_test[numVars_with_missingVal].isnull().sum()

#option2
#[var for var in numVars_with_missingVal if X_test[var].isnull().sum() > 0]

LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64

#### Temporal variables

Recall: 4 temporal variables were identified during in the data analysis notebook. Theses variables are YrSold, YearBuilt, YearRemodAdd, and GarageYrBlt. We'll compute the time difference between YrSold and the other three temporal variables.

In [13]:
def calc_yrs_diffs(df, var):
    df[var] = df['YrSold'] - df[var]
    return df

for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:
    X_train = calc_yrs_diffs(X_train, var)
    X_test = calc_yrs_diffs(X_test, var) 

In [14]:
X_train.head()
#X_train.shape

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
907,908,50,RL,86.0,11500,Pave,missingVal,IR1,Lvl,AllPub,Inside,Gtl,Crawfor,Norm,Norm,1Fam,1.5Fin,7,7,70,19,Gable,CompShg,BrkFace,BrkFace,,0.0,Gd,TA,CBlock,Gd,TA,No,Rec,223,Unf,0,794,1017,GasA,Gd,Y,SBrkr,1020,1037,0,2057,0,0,1,1,3,1,Gd,6,Typ,1,Gd,Attchd,70.0,Fin,1,180,Fa,TA,Y,0,0,0,0,322,0,missingVal,missingVal,missingVal,0,6,2006,WD,Normal,250000
782,783,20,RL,67.0,16285,Pave,missingVal,IR2,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,1Story,7,5,8,7,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,TA,No,Unf,0,Unf,0,1413,1413,GasA,Ex,Y,SBrkr,1430,0,0,1430,0,0,2,0,3,1,Gd,6,Typ,0,missingVal,Attchd,8.0,RFn,2,605,TA,TA,Y,0,33,0,0,0,0,missingVal,missingVal,missingVal,0,6,2009,WD,Normal,187100
952,953,85,RL,60.0,7200,Pave,missingVal,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,SFoyer,5,8,37,6,Gable,CompShg,WdShing,HdBoard,,0.0,TA,Gd,CBlock,Gd,TA,Av,GLQ,660,Unf,0,108,768,GasA,Gd,Y,SBrkr,768,0,0,768,0,1,1,0,2,1,TA,5,Typ,0,missingVal,Detchd,35.0,Fin,1,396,TA,TA,Y,192,0,0,0,0,0,missingVal,MnPrv,missingVal,0,4,2009,WD,Normal,133900
620,621,30,RL,45.0,8248,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,Edwards,Norm,Norm,1Fam,1Story,3,3,94,58,Gable,CompShg,Stucco,Stucco,,0.0,TA,TA,BrkTil,TA,TA,No,BLQ,41,Unf,0,823,864,GasA,TA,N,FuseF,864,0,0,864,1,0,1,0,2,1,TA,5,Typ,0,missingVal,missingVal,3.0,missingVal,0,0,missingVal,missingVal,N,0,0,100,0,0,0,missingVal,missingVal,missingVal,0,9,2008,WD,Normal,67000
669,670,30,RL,80.0,11600,Pave,missingVal,Reg,Lvl,AllPub,Inside,Gtl,Crawfor,Norm,Norm,1Fam,1Story,4,5,84,56,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,Fa,TA,No,Unf,0,Unf,0,700,700,GasA,Ex,Y,SBrkr,1180,0,0,1180,0,0,1,0,2,1,Fa,5,Typ,1,Gd,Detchd,84.0,Unf,1,252,TA,Fa,Y,0,0,67,0,0,0,missingVal,missingVal,missingVal,0,7,2006,WD,Normal,137500


#### Numeric variable Transformation

In [15]:
#log transform the positive numerical variables in order to get a more Gaussian-like distributions
for var in ['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']:
    X_train[var] = np.log(X_train[var])
    X_test[var] = np.log(X_test[var])

In [16]:
# verify that X_train set does not contain missing values after feature engineering
[var for var in ['LotFrontage', 'LotArea', '1stFlrSF',
                 'GrLivArea', 'SalePrice'] if X_train[var].isnull().sum() > 0]

[]

In [17]:
#verify that X_test set does not contain missing values after feature engineering
[var for var in ['LotFrontage', 'LotArea', '1stFlrSF',
                 'GrLivArea', 'SalePrice'] if X_test[var].isnull().sum() > 0]

[]

In [18]:
#test
#X_train.head() 

#### Encoding Categorical Variables

##### using LabelEncoder()

###### Customized code for encoding categorical variables.

In [19]:
#Create a list for the categorical variables in a list    
categorical_vars = [var for var in df.columns if df[var].dtype == 'object']

In [20]:
# This function will assign ordinal numbers(eg. 0,1,2,3,4, etc) to the strings of the categorical variables,
# so that the smaller value corresponds to the category that shows the smaller
# mean house sale price. These numbers are ordered using the calculated mean for the target variable.


def custom_encode_categories(train, test, var, target):

    #Order the categories in each variable starting with the category that has the lowest
    #target value(SalePrice) to that with the highest
    ordered_lbls = train.groupby([var])[target].mean().sort_values().index

    #Create a dictionary that will hold key:value pairs for each variable i.e., {category: ordinal number}
    ordinal_lbls = {k: i for i, k in enumerate(ordered_lbls, 0)}

    #Use the above dictionary to replace the categorical strings by integers
    train[var] = train[var].map(ordinal_lbls)
    test[var] = test[var].map(ordinal_lbls)

for var in categorical_vars:
    custom_encode_categories(X_train, X_test, var, 'SalePrice')

#### Feature scaling

In [21]:
#Create all variables(for X_train & X_test) except 'Id' and 'SalePrice'
train_vars = [var for var in X_train.columns if var not in ['Id', 'SalePrice']]
test_vars = [var for var in X_test.columns if var not in ['Id', 'SalePrice']]

# count number of variables
print(len(train_vars))
print(len(test_vars))

79
79


In [22]:
# create scaler
scaler = MinMaxScaler()
#  fit  the scaler to the train set
scaler.fit(X_train[train_vars]) 

# Transform the train and test set
X_train[train_vars] = scaler.transform(X_train[train_vars])
X_test[train_vars] = scaler.transform(X_test[train_vars])

In [23]:
X_train.head(3)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
907,908,0.176471,0.75,0.521833,0.42666,1.0,1.0,0.333333,0.333333,1.0,0.25,0.0,0.708333,0.375,0.571429,0.75,0.142857,0.666667,0.75,0.514706,0.327869,0.2,0.285714,0.714286,0.733333,0.25,0.0,0.666667,0.75,0.4,0.75,0.75,0.25,0.166667,0.039511,0.666667,0.0,0.339897,0.166448,1.0,0.75,1.0,1.0,0.422489,0.502179,0.0,0.64307,0.0,0.0,0.333333,0.5,0.375,0.333333,0.666667,0.333333,1.0,0.333333,0.8,0.833333,0.654206,1.0,0.25,0.126939,0.4,1.0,1.0,0.0,0.0,0.0,0.0,0.670833,0.0,0.0,0.75,0.5,0.0,0.454545,0.0,0.625,0.8,12.429216
782,783,0.0,0.75,0.429425,0.49475,1.0,1.0,1.0,0.333333,1.0,0.25,0.0,0.666667,0.375,0.571429,0.75,0.571429,0.666667,0.5,0.058824,0.131148,0.2,0.285714,0.785714,0.8,0.25,0.0,0.666667,0.75,1.0,0.75,0.75,0.25,0.833333,0.0,0.666667,0.0,0.60488,0.23126,1.0,1.0,1.0,1.0,0.550351,0.0,0.0,0.514455,0.0,0.0,0.666667,0.0,0.375,0.333333,0.666667,0.333333,1.0,0.0,0.2,0.833333,0.074766,0.666667,0.5,0.426657,0.6,1.0,1.0,0.0,0.060329,0.0,0.0,0.0,0.0,0.0,0.75,0.5,0.0,0.454545,0.75,0.625,0.8,12.139399
952,953,0.382353,0.75,0.388581,0.335012,1.0,1.0,0.0,0.333333,1.0,0.25,0.0,0.666667,0.375,0.571429,0.75,0.285714,0.444444,0.875,0.272059,0.114754,0.2,0.285714,0.285714,0.533333,0.25,0.0,0.333333,0.5,0.4,0.75,0.75,0.75,1.0,0.116938,0.666667,0.0,0.046233,0.125696,1.0,0.75,1.0,1.0,0.315102,0.0,0.0,0.29455,0.0,0.5,0.333333,0.0,0.25,0.333333,0.333333,0.25,1.0,0.0,0.2,0.333333,0.327103,1.0,0.25,0.279267,0.6,1.0,1.0,0.224037,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.272727,0.75,0.625,0.8,11.804849


In [24]:
X_test.head(3)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
892,893,0.0,0.75,0.445638,0.365508,1.0,1.0,0.0,0.333333,1.0,0.25,0.0,0.25,0.375,0.571429,0.75,0.571429,0.555556,0.875,0.316176,0.065574,0.8,0.285714,0.571429,0.533333,0.25,0.0,0.333333,0.75,0.4,0.5,0.75,0.25,1.0,0.11747,0.666667,0.0,0.169521,0.173322,1.0,0.5,1.0,1.0,0.439892,0.0,0.0,0.4112,0.0,0.5,0.333333,0.0,0.375,0.333333,0.333333,0.333333,1.0,0.0,0.2,0.833333,0.401869,0.666667,0.25,0.186178,0.6,1.0,1.0,0.224037,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,0.090909,0.0,0.625,0.8,11.947949
1105,1106,0.235294,0.75,0.57018,0.439121,1.0,1.0,0.333333,0.333333,1.0,0.0,0.0,1.0,0.375,0.571429,0.75,0.857143,0.777778,0.5,0.117647,0.262295,0.2,0.285714,0.571429,0.533333,0.5,0.22625,0.666667,0.75,1.0,1.0,0.75,0.75,1.0,0.182849,0.666667,0.0,0.184503,0.239444,1.0,1.0,1.0,1.0,0.568437,0.543341,0.0,0.728921,0.333333,0.0,0.666667,0.5,0.375,0.333333,0.666667,0.583333,1.0,0.666667,0.6,0.833333,0.149533,0.666667,0.5,0.502116,0.6,1.0,1.0,0.217036,0.058501,0.0,0.0,0.0,0.0,0.0,0.75,0.5,0.0,0.272727,1.0,0.625,0.8,12.69158
413,414,0.058824,0.25,0.363044,0.377814,1.0,0.0,0.0,0.333333,1.0,0.25,0.0,0.208333,0.0,0.571429,0.75,0.571429,0.444444,0.625,0.610294,1.0,0.2,0.285714,0.285714,0.466667,0.25,0.0,0.333333,0.75,0.4,0.5,0.75,0.25,0.833333,0.0,0.666667,0.0,0.431507,0.164975,1.0,0.75,1.0,0.5,0.425446,0.0,0.0,0.397696,0.0,0.0,0.333333,0.0,0.25,0.333333,0.333333,0.25,1.0,0.333333,0.8,0.333333,0.775701,0.333333,0.5,0.253879,0.6,1.0,1.0,0.0,0.0,0.235507,0.0,0.0,0.0,0.0,0.75,0.5,0.0,0.181818,1.0,0.625,0.8,11.652687


#### Save X_train and X_test Dataset.

In [25]:
X_train.to_csv('my_Xtrain.csv', index=False)#'index=False' to drop the index column
X_test.to_csv('my_Xtest.csv', index=False)