### Notebook Summary

* Read csv data as Pandas dataframe
* Get numeric column names 
* Consider the numeric columns that are measurements
* Obtain a statistical summary of the numeric columns
    - mean
    - median
    - standard deviation
    - skewness
    - kurtosis
    - zscore  

### Imports

In [1]:
import pandas as pd
import numpy as np
from scipy.stats.mstats import zscore

### Directory

In [2]:
BASE_DIR = "../AmesHousing/"
DATA_IN = BASE_DIR+"DataDwn/"

### Training Data

In [3]:
trn = pd.read_csv(DATA_IN+"train.csv")
trn.shape

(1460, 81)

**Numeric column names**

In [4]:
trn_num = trn._get_numeric_data()
trn_num.columns.values

array(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
       'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF',
       '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'SalePrice'], dtype=object)

**Feature names that (aparently) indicate measurements**

In [5]:
feature_names = ['LotFrontage', 'LotArea', 'MasVnrArea',
       'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 
       '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 
       'BsmtFullBath','BsmtHalfBath', 'FullBath', 'HalfBath', 
       'BedroomAbvGr','KitchenAbvGr', 'TotRmsAbvGrd', 
       'Fireplaces',
       'GarageCars', 'GarageArea', 
       'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 
       'SalePrice']
print(len(feature_names), "features selected")

28 features selected


### Statistical summary

In [6]:
trn_num = trn._get_numeric_data()
trn_num.shape

(1460, 38)

In [7]:
for feature in feature_names:
    print(feature)
    print(trn_num[feature].describe())
    skewness = trn_num[feature].skew()
    kurtosis = trn_num[feature].kurt()
    feature_dropna = trn_num[feature].dropna(axis=0, how='any')
    f = feature_dropna.tolist()
    z = zscore(f)
    outliers = len(np.where(z >= 3)[0]) + len(np.where(z <= -3)[0])
    
    print("Mean: ",  np.mean(f))
    print("Median: ",  np.median(f))
    print("Skewness:", skewness)
    print("Kurtosis:", kurtosis)
    print("Outliers:", outliers)
    print(" ")
    
    

LotFrontage
count    1201.000000
mean       70.049958
std        24.284752
min        21.000000
25%        59.000000
50%        69.000000
75%        80.000000
max       313.000000
Name: LotFrontage, dtype: float64
Mean:  70.049958368
Median:  69.0
Skewness: 2.16356914232
Kurtosis: 17.4528672598
Outliers: 12
 
LotArea
count      1460.000000
mean      10516.828082
std        9981.264932
min        1300.000000
25%        7553.500000
50%        9478.500000
75%       11601.500000
max      215245.000000
Name: LotArea, dtype: float64
Mean:  10516.8280822
Median:  9478.5
Skewness: 12.2076878512
Kurtosis: 203.243271019
Outliers: 13
 
MasVnrArea
count    1452.000000
mean      103.685262
std       181.066207
min         0.000000
25%         0.000000
50%         0.000000
75%       166.000000
max      1600.000000
Name: MasVnrArea, dtype: float64
Mean:  103.685261708
Median:  0.0
Skewness: 2.66908421018
Kurtosis: 10.0824173174
Outliers: 32
 
BsmtFinSF1
count    1460.000000
mean      443.639726
std  