Features related to pool
- PoolArea - pool area in square feet
- PoolQC - pool quality

In the baseline model
- removed PoolQC because of too many missing values.
- imputed PoolArea with most frequent value => can be wrong, because missing values might mean that house does not have a pool, which is a signal of having lower price compared to houses having pool.

Inspect
- whether having pool increase the price by pulling houses with similar features, one with pool and one without pool -> use distance between features no get similar houses

Conclusion
- PoolArea - should not encode this using most_frequent strategy, missing value should be encoded as 0 (no pool).

In [20]:
import pandas as pd
import numpy as np

%matplotlib inline

In [31]:
PATH='../data/'
train = pd.read_csv(PATH + 'train.csv')
test = pd.read_csv(PATH + 'test.csv')

In [17]:
"""
PoolQC - pool quality
    Ex Excellent
    Gd Good
    TA Average/Typical
    Fa Fair
    NA No Pool
"""
print(train['PoolQC'].value_counts(dropna=False))
print()
print(test['PoolQC'].value_counts(dropna=False))

NaN    1453
Gd        3
Ex        2
Fa        2
Name: PoolQC, dtype: int64

NaN    1456
Ex        2
Gd        1
Name: PoolQC, dtype: int64


In [18]:
print(train['PoolArea'].value_counts(dropna=False))
print()
print(test['PoolArea'].value_counts(dropna=False))

0      1453
738       1
648       1
576       1
555       1
519       1
512       1
480       1
Name: PoolArea, dtype: int64

0      1453
800       1
561       1
444       1
368       1
228       1
144       1
Name: PoolArea, dtype: int64


### adding feature HavingPool 0:no pool, 1:has pool

Looking at unique value counts for PoolArea, it might be useful to add a feature HavingPool to help model to determine easier whether house has pool or not.

The number of house having pools are 7 which is equal to 7 PoolQC (quality) => definitely should keep this feature to explain pool quality to the model.

#### PoolQC
- there is level of quality -> can use ordinal encoder to encode this feature. For example:
    - NA - 0
    - Fa - 1
    - TA - 2
    - Gd - 3
    - Ex - 4

In [12]:
# get houses with pool and their prices
cols = ['Id', 'PoolArea', 'PoolQC', 'SalePrice']
pool = train[cols] 
pool = pool[pool.PoolArea != 0] # get only rows where house has pool
pool.head(10)

Unnamed: 0,Id,PoolArea,PoolQC,SalePrice
197,198,512,Ex,235000
810,811,648,Fa,181000
1170,1171,576,Gd,171000
1182,1183,555,Ex,745000
1298,1299,480,Gd,160000
1386,1387,519,Fa,250000
1423,1424,738,Gd,274970


In [13]:
import category_encoders as ce

encoder = ce.OrdinalEncoder(cols=['PoolQC'])
encoder.fit_transform(pool)

Unnamed: 0,Id,PoolArea,PoolQC,SalePrice
197,198,512,1,235000
810,811,648,2,181000
1170,1171,576,3,171000
1182,1183,555,1,745000
1298,1299,480,3,160000
1386,1387,519,2,250000
1423,1424,738,3,274970


In [23]:
replace_poolqc = {"PoolQC": {"Ex": 3, "Gd": 2, "Fa": 1, np.nan:0}}
train.replace(replace_poolqc)['PoolQC'].value_counts()

0    1453
2       3
3       2
1       2
Name: PoolQC, dtype: int64

### Transformer

In [38]:
# write a transformer to encode PoolQC
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline

class PoolQCTransformer(TransformerMixin):
    """
    transform PoolQC column
    """
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        map_poolqc = {"PoolQC": {"Ex": 3, "Gd": 2, "Fa": 1, np.nan:0}}
        return X.replace(map_poolqc)
    
# test PoolQCTransformer
pipe = Pipeline([
    ('PoolQCEncoding', PoolQCTransformer())
])
train_transformed = pipe.transform(train)

print("train")
print(train['PoolQC'].value_counts(dropna=False))
print("\ntrain_transformed")
print(train_transformed['PoolQC'].value_counts(dropna=False))

train
NaN    1453
Gd        3
Ex        2
Fa        2
Name: PoolQC, dtype: int64

train_transformed
0    1453
2       3
3       2
1       2
Name: PoolQC, dtype: int64


In [11]:
# write a transformer to encode PoolArea

In [39]:
import category_encoders as ce