## 3. Feature selection and EDA


#### Distribution of target variable
Since house prices are unlikely to be normally distributed, we use the natural logarithm of the sale price to obtain a more gaussian distribution


In [None]:
df['SalePrice_log'] = np.log(df['SalePrice'])

df_salePrice = df[['SalePrice', 'SalePrice_log']]

fig = plt.figure() # create figure

ax0 = fig.add_subplot(1, 2, 1) # add subplot 1 (1 row, 2 columns, first plot)
ax1 = fig.add_subplot(1, 2, 2) # add subplot 2 (1 row, 2 columns, second plot). See tip below**

# Subplot 1: Histogram
df_salePrice['SalePrice'].plot(kind='hist', color='blue', figsize=(20, 6), ax=ax0) # add to subplot 1
ax0.set_title('Histogram of Sale Price Distribution')
ax0.set_xlabel('Sale Price')
ax0.set_ylabel('Number of Sales')

# Subplot 2: Histogram
df_salePrice['SalePrice_log'].plot(kind='hist', figsize=(20, 6), ax=ax1) # add to subplot 2
ax1.set_title ('Histogram of Log Sale Price Distribution')
ax1.set_xlabel('Log Sale Price')
ax1.set_ylabel('Number of Sales')

plt.show()

Quick notes and reduction:
Since 1stFlrSF + 2ndFlrSF == GrLivArea, we drop 1stFlrSF and 2ndFlrSF


In [None]:
df[['1stFlrSF', '2ndFlrSF']].sum(axis = 1).corr(df['GrLivArea']).round(2)

df.drop(['1stFlrSF', '2ndFlrSF'], axis = 1, inplace = True)

The feature data from the ames data set can be split into 10 categories: basement, bath, garage, kitchen, pool, fireplace, heating, masonry, porch, and finally miscellaneous.

### 3.1 Basement

##### Categorical features 

We start by constructing dictionaries to get numerical inputs rather than strings

In [None]:
rank_dict = {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'None': 1, 'Po':0}
BsmtFin_dict = {'GLQ':5, 'ALQ':4, 'BLQ':3, 'Rec':2, 'LwQ':1, 'Unf':0, 'None':0}
BsmtEx_dict = {'Gd':3, 'Av':2, 'Mn':1, 'No':0, 'None':0}

#  BsmtFinType1 & 2: Rating of basement finished area 1 and 2 (if multiple types)
df['BsmtFinType1'] = df.BsmtFinType1.replace(BsmtFin_dict)
df['BsmtFinType2'] = df.BsmtFinType2.replace(BsmtFin_dict)

# BsmtExposure: Refers to walkout or garden level walls
df['BsmtExposure'] = df.BsmtExposure.replace(BsmtEx_dict)

# BsmtCond: Condition of the basement
df['BsmtCond'] = df.BsmtCond.replace(rank_dict)

# BsmtQual: Evaluates the height of the basement
df['BsmtQual'] = df.BsmtQual.replace(rank_dict)

bsmt_cat_corr = df[['SalePrice_log','BsmtFinType1', 'BsmtFinType2', 'BsmtExposure', 'BsmtCond', 'BsmtQual']].corr()
bsmt_cat_corr.style.background_gradient(cmap='coolwarm')

bsmt_cat = ['BsmtFinType1', 'BsmtFinType2', 'BsmtExposure', 'BsmtCond', 'BsmtQual']

for bsmt in bsmt_cat:
    df.boxplot(column = 'SalePrice_log', by = bsmt)
    plt.suptitle('')
    plt.show()

While BsmtFinType1 abd BsmtFinType2 don't appear to add much predictive power, they connected with the BsmtFinSF1 and BsmtFinSF2 variables, we therefore attempt create two new interaction variables.


In [None]:
df['BsmtType1Inter'] = df['BsmtFinType1'] * df['BsmtFinSF1']
df['BsmtType2Inter'] = df['BsmtFinType2'] * df['BsmtFinSF2']

##### Numerical features 

In [None]:
temp_df = df[['SalePrice_log','BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath','BsmtType1Inter', 'BsmtType2Inter']]

correl = temp_df.corr()
correl.style.background_gradient(cmap='coolwarm')

sns.pairplot(temp_df)

#### Basement summary:

##### Categorical:
From the boxplots above it is evident that BsmtCond, BsmtQual have some clear predictive power, while BsmtExposure display some evidence of predictive power. However, BsmtFinType1 and BsmtFinType2 show no predictive power.

##### Numerical:
From the scatter plots and heat map, we observe that very little explanatory power from BsmtUnfSF, BsmtFinSF2 and BsmtFullBath, BsmtHalfBath. To save running time and prevent overfitting we remove them. We did not see any significant improvement in the predictive performance of BsmtFinType1 and BsmtFinType2 after interacting with BsmtFinSF1, BsmtFinSF2. We therefore drop the interaction terms. BsmtFinSF1 is also removed due to the strong correlation with the TotalBsmtSF variable. 


In [None]:
df.drop(['BsmtFinType1', 'BsmtFinType2', 'BsmtFinSF1', 'BsmtFinSF2', 
         'BsmtUnfSF', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtType1Inter', 'BsmtType2Inter'], axis = 1, inplace = True)

### 3.2 Garage

#### Categorical variables for Garage

In [None]:
rank_dict = {'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'None': 1, 'Po':0}
GarFin_dict = {'Fin':2, 'RFn':1, 'Unf':0}
GarType_dict = {'2Types': '2Types', 'Attchd': 'BuiltIn', 'Basment': 'BuiltIn', 'CarPort': 'Separate', 
               'Detchd': 'Separate'}

# Garage Quality and Condition
df['GarageQual'] = df.GarageQual.replace(rank_dict)
df['GarageCond'] = df.GarageCond.replace(rank_dict)
df['GarageFinish'] = df.GarageFinish.replace(GarFin_dict)

# Since type of garage is either built in the house or separate from the house, we can categorize some of them together:
df['GarageType'] = df['GarageType'].replace(GarType_dict)
df.loc[df['GarageType'].isna(), 'GarageType'] = 'NoGarage'

# Garage 
g_cat = ['GarageType', 'GarageQual', 'GarageCond', 'GarageFinish']

for gar in g_cat:
    df.boxplot(column = 'SalePrice_log', by = gar)
    plt.suptitle('')
    plt.show()

In [None]:
#### Numerical for Garage

df['YrSinceGarageBlt'] = df['YrSold'] - df['GarageYrBlt']

sns.pairplot(df[['SalePrice_log','YrSinceGarageBlt', 'GarageCars', 'GarageArea', 'GarageQual','GarageCond']])


##### Summary:
GarageQual and GarageCond seem very similar in the data description. We observe a high correlation and can therefore remove one of them. We will remove GarageCond.
GarageCars represents width, given that the more cars, most likely, the wider the garage. Garage area is the total square ft. They do tell some of them same story, but are not exactly the same. We chose to keep both for now. 
The YrSinceGarageBlt feature seem to capture a negative relation, the older the garage the lower the price. This varaible should be tested against the age of the house, as they may explain the same. 


In [None]:
df.drop('GarageCond', axis = 1, inplace = True)

In [None]:
# Number of sales in each neighborhood
sales_nbh = df['Neighborhood'].value_counts()
sales_nbh.columns = ['Number of Sales']

sales_nbh.plot(kind = 'bar')
plt.title('Number of Sales in Ames\' Neighborhoods')
plt.ylabel('Number of Sales')
plt.xlabel('Neighborhood')
plt.show()


In [None]:
# AVG SALE PRICE BY NEIGHBORHOOD
nbhAvgPrice = df.groupby('Neighborhood')['SalePrice'].mean().sort_values(ascending = False)
nbhAvgPrice = nbhAvgPrice.to_frame().rename(columns={'SalePrice': 'AvgSalePrice'})
nbhAvgPrice['AvgSalePrice'] = nbhAvgPrice['AvgSalePrice'].round(2)
nbhAvgPrice = nbhAvgPrice.merge(nbh_df, on = 'Neighborhood')
nbhAvgPrice.head()


In [None]:
#corr = df[['SalePrice_log','BsmtFinType1', 'BsmtFinType2', 'BsmtExposure', 'BsmtCond', 'BsmtQual']].corr()
#corr.style.background_gradient(cmap='coolwarm')


In [None]:
# Visual plot
# create map of Ames using latitude and longitude values
ames_latitude = 43.651070
ames_longitude = -93.620369

map_ames = folium.Map(location=[ames_latitude, ames_longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood, avgPrice in zip(nbhAvgPrice['Latitude'], nbhAvgPrice['Longitude'], nbhAvgPrice['Neighborhood'], nbhAvgPrice['AvgSalePrice']):
    label = '{}, Average Sales Price: {}'.format(neighborhood, avgPrice)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ames)  

display(map_ames)

# Lat/lng: Ames, IA, USA



#df['Neighborhood'] = df['Neighborhood'].map(nbh)