### Notebook Summary

**A first look at data**
* Read csv data as Pandas dataframe
* Get columns names
* Check data types of the columns
* Separate the numereical and non-numerical columns
* Check for missing values (number of missing values in the columns and their pecentage)
* Explore quesions like - 
    - which types of houses have certain missing features etc.
    - are sales price different for certain missing features? 

### Imports

In [4]:
import pandas as pd

### Directory

In [5]:
BASE_DIR = "../AmesHousing/"
DATA_IN = BASE_DIR+"DataDwn/"

### Training Data

In [12]:
trn = pd.read_csv(DATA_IN+"train.csv")
trn.shape
print("Training data:", trn.shape[0], "examples and", trn.shape[1], "features.")

Training data: 1460 examples and 81 features.


**Column names**

In [7]:
feature_names = trn.columns.values
feature_names

array(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond',
       'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir',
       'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea',
       'GarageQual', 'GarageCond', 'Pav

**Data types of the features**

In [8]:
trn.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
Alley             object
LotShape          object
LandContour       object
Utilities         object
LotConfig         object
LandSlope         object
Neighborhood      object
Condition1        object
Condition2        object
BldgType          object
HouseStyle        object
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
RoofStyle         object
RoofMatl          object
Exterior1st       object
Exterior2nd       object
MasVnrType        object
MasVnrArea       float64
ExterQual         object
ExterCond         object
Foundation        object
                  ...   
BedroomAbvGr       int64
KitchenAbvGr       int64
KitchenQual       object
TotRmsAbvGrd       int64
Functional        object
Fireplaces         int64
FireplaceQu       object
GarageType        object
GarageYrBlt      float64


**numeric features**

In [9]:
trn_num = trn._get_numeric_data()
trn_num.shape

(1460, 38)

In [10]:
trn_num.dtypes

Id                 int64
MSSubClass         int64
LotFrontage      float64
LotArea            int64
OverallQual        int64
OverallCond        int64
YearBuilt          int64
YearRemodAdd       int64
MasVnrArea       float64
BsmtFinSF1         int64
BsmtFinSF2         int64
BsmtUnfSF          int64
TotalBsmtSF        int64
1stFlrSF           int64
2ndFlrSF           int64
LowQualFinSF       int64
GrLivArea          int64
BsmtFullBath       int64
BsmtHalfBath       int64
FullBath           int64
HalfBath           int64
BedroomAbvGr       int64
KitchenAbvGr       int64
TotRmsAbvGrd       int64
Fireplaces         int64
GarageYrBlt      float64
GarageCars         int64
GarageArea         int64
WoodDeckSF         int64
OpenPorchSF        int64
EnclosedPorch      int64
3SsnPorch          int64
ScreenPorch        int64
PoolArea           int64
MiscVal            int64
MoSold             int64
YrSold             int64
SalePrice          int64
dtype: object

**object features**

In [73]:
trn_obj = trn.select_dtypes(include=['object'])
trn_obj.shape

(1460, 43)

In [74]:
trn_obj.dtypes

MSZoning         object
Street           object
Alley            object
LotShape         object
LandContour      object
Utilities        object
LotConfig        object
LandSlope        object
Neighborhood     object
Condition1       object
Condition2       object
BldgType         object
HouseStyle       object
RoofStyle        object
RoofMatl         object
Exterior1st      object
Exterior2nd      object
MasVnrType       object
ExterQual        object
ExterCond        object
Foundation       object
BsmtQual         object
BsmtCond         object
BsmtExposure     object
BsmtFinType1     object
BsmtFinType2     object
Heating          object
HeatingQC        object
CentralAir       object
Electrical       object
KitchenQual      object
Functional       object
FireplaceQu      object
GarageType       object
GarageFinish     object
GarageQual       object
GarageCond       object
PavedDrive       object
PoolQC           object
Fence            object
MiscFeature      object
SaleType        

**Note:** Features such as `ID`, `MSSubClass`, `OverallQual`, `OverallCond` are num type but they are not measurements. `ID`, `MSSubClass` are identifiers. `OverallQual`, `OverallCond` are categorical variables.

**Missing values?**

Total and precengate of missing value in columns

In [75]:
tot_missing = trn.isnull().sum().sort_values(ascending=False)
percent_missing = (trn.isnull().sum()/trn.isnull().count()).sort_values(ascending=False)
missing_num = pd.concat([tot_missing, percent_missing], axis=1, keys=['Total', 'Percent'])
missing_num.head(20)

Unnamed: 0,Total,Percent
PoolQC,1453,0.995205
MiscFeature,1406,0.963014
Alley,1369,0.937671
Fence,1179,0.807534
FireplaceQu,690,0.472603
LotFrontage,259,0.177397
GarageCond,81,0.055479
GarageType,81,0.055479
GarageYrBlt,81,0.055479
GarageFinish,81,0.055479


**Note:** Most houses usually don't have pools, fireplace, fence etc. accessories. ~5% of the houses don't have garage. ~2% houses don't have basement.

**About garage**
* Types of houses that have missing garage information.
* Sales price of houses with and without garage.

In [13]:
null_garage = trn[trn['GarageYrBlt'].isnull()]
null_garage[['BldgType','HouseStyle','GarageCars', 'GarageArea', 'SalePrice']]

Unnamed: 0,BldgType,HouseStyle,GarageCars,GarageArea,SalePrice
39,Duplex,1Story,0,0,82000
48,2fmCon,2Story,0,0,113000
78,Duplex,1Story,0,0,136500
88,1Fam,1.5Fin,0,0,85000
89,1Fam,1Story,0,0,123600
99,1Fam,1Story,0,0,128950
108,1Fam,1.5Fin,0,0,115000
125,2fmCon,1.5Fin,0,0,84500
127,1Fam,1.5Unf,0,0,87000
140,1Fam,1Story,0,0,115000


**Note:** Houses w/o garages show a mix. 

**Price of houses with/out garages**

In [15]:
null_garage['SalePrice'].describe()

count        81.000000
mean     103317.283951
std       32815.023389
min       34900.000000
25%       82500.000000
50%      100000.000000
75%      124000.000000
max      200500.000000
Name: SalePrice, dtype: float64

In [16]:
notnull_garage = trn[trn['GarageYrBlt'].notnull()]
notnull_garage['SalePrice'].describe()

count      1379.00000
mean     185479.51124
std       79023.89060
min       35311.00000
25%      134000.00000
50%      167500.00000
75%      217750.00000
max      755000.00000
Name: SalePrice, dtype: float64

**Note:** Houses with garages have a higher mean price. But min(sales_price) are similar.