### Notebook Summary

Explore unvariate correlation 
- how strong are the linear correlations (Pearson R) between the sales price and the numerical measuremnt features.
- rank features based on R.

### Imports

In [1]:
import pandas as pd
import numpy as np

### Directory

In [2]:
BASE_DIR = "../AmesHousing/"
DATA_IN = BASE_DIR+"DataDwn/"

### Training Data

In [3]:
trn = pd.read_csv(DATA_IN+"train.csv")
trn.shape

(1460, 81)

**Numeric column names**

In [4]:
trn_num = trn._get_numeric_data()
trn_num.columns.values

array(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
       'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
       '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'SalePrice'], dtype=object)

**Feature names that (aparently) indicate measurements**

In [5]:
feature_names = ['LotFrontage', 'LotArea', 'MasVnrArea',
       'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 
       '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 
       'BsmtFullBath','BsmtHalfBath', 'FullBath', 'HalfBath', 
       'BedroomAbvGr','KitchenAbvGr', 'TotRmsAbvGrd', 
       'Fireplaces',
       'GarageCars', 'GarageArea', 
       'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 
       'SalePrice']
print(len(feature_names), "features selected")

28 features selected


### Correlation
Pearson correlaton coefficient R

In [6]:
trn_select = trn[feature_names]
trn_select.shape

(1460, 28)

In [7]:
corr = trn_select.corr().sort_values(by="SalePrice",ascending=False)
corr.head(3)

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,...,Fireplaces,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,SalePrice
SalePrice,0.351799,0.263843,0.477493,0.38642,-0.011378,0.214479,0.613581,0.605852,0.319334,-0.025606,...,0.466929,0.640409,0.623431,0.324413,0.315856,-0.128578,0.044584,0.111447,0.092404,1.0
GrLivArea,0.402797,0.263116,0.390857,0.208171,-0.00964,0.240257,0.454868,0.566024,0.687501,0.134683,...,0.461679,0.467247,0.468997,0.247433,0.330224,0.009113,0.020643,0.10151,0.170205,0.708624
GarageCars,0.285691,0.154871,0.364204,0.224054,-0.038264,0.214175,0.434585,0.439317,0.183926,-0.09448,...,0.300789,1.0,0.882475,0.226342,0.213569,-0.151434,0.035765,0.050494,0.020934,0.640409


In [8]:
corr["SalePrice"]

SalePrice        1.000000
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
MasVnrArea       0.477493
Fireplaces       0.466929
BsmtFinSF1       0.386420
LotFrontage      0.351799
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
HalfBath         0.284108
LotArea          0.263843
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
ScreenPorch      0.111447
PoolArea         0.092404
3SsnPorch        0.044584
BsmtFinSF2      -0.011378
BsmtHalfBath    -0.016844
LowQualFinSF    -0.025606
EnclosedPorch   -0.128578
KitchenAbvGr    -0.135907
Name: SalePrice, dtype: float64

**Note:** As expected the floor area, garage size/capacity number of bathrooms has strong linear correlaiton with the sales price.
Number of fireplaces is an indication of how big the house is. Hence, a noticiable correlation. 
MasVnrArea (Masonry veneer area in square feet) shows a strong correlation. Not clear what part of the house it referes to or whether this finish is alrady incided in other SF measurements.
Deck, porch, pool are not applicable to most houses but they do contribute to the size of the house.