# Ames Housing Price Prediction

## Step 1: Frame the Problem

- **Objective**: With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, we to predict the final price of each home.
    
- From my little understanding of real estate, the price of a house majorly depends on the location, proximity to entertainment and essential facilities, age of the house, number of bedrooms. Looking at the dataset, there is more information than I thought of. 
- Dataset has 79 variables describing (almost) every aspect of residential homes in Ames, Iowa. The SalePrice field is a target variable. The price is in USD.
- There are many columns that have NA values. But those NA has a specific meaning for each feature. So I will keep those NA values in the dataset. 
Following list represents NA value description & the fields in which that NA value is present:

- NA: No alley access
    - Alley: Type of alley access to property


- NA: No Basement
    - BsmtQual: Evaluates the height of the basement
    - BsmtCond: Evaluates the general condition of the basement
    - BsmtExposure: Refers to walkout or garden level walls
    - BsmtFinType1: Rating of basement finished area
    - BsmtFinType2: Rating of basement finished area (if multiple types)


- NA: No Garage
    - GarageType: Garage location
    - GarageFinish: Interior finish of the garage
    - GarageCond: Garage condition
    - GarageQual: Garage quality


- NA: No Pool
    - PoolQC: Pool quality


- NA: No Fence
    - Fence: Fence quality


- NA: None   
    - MiscFeature: Miscellaneous feature not covered in other categories

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import warnings
warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
print(data.shape)
data.head()

## Step 2: Data Exploration

In [None]:
data_explore = data.copy()
data_explore = data_explore.drop(columns="Id", axis=1)

- If we look at the data, there are few areas/segments around which attributes are present. Ex. for segment Garage, there are many attributes related with it such as GarageCond, GarageArea, GarageCars etc.
- I will explore following important segments in given dataset:
    - Plot
    - Zone & Neighbourhood
    - Type of house
    - Proximity to various conditions
    - Garage

In [None]:
data_explore.info()

In [None]:
nulls = data_explore.isna().sum()
nulls[nulls>0]

We know that there are columns which has NA values with some specific meaning. Lets replace those columns NaN with NA.

In [None]:
na_cols = ["Alley", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "GarageType", "GarageFinish", "GarageCond", "GarageQual"]

data_explore[na_cols] = data_explore[na_cols].fillna("NA")

In [None]:
data_explore["Alley"].value_counts()

In [None]:
nulls = data_explore.isna().sum()
nan_cols = nulls[nulls>0].index
data_explore[nan_cols].info()

In [None]:
from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy="mean")
cat_imputer = SimpleImputer(strategy="most_frequent")

In [None]:
num_nans = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
cat_nans = ['MasVnrType', 'Electrical', 'FireplaceQu']
data_explore[num_nans] = num_imputer.fit_transform(data[num_nans])
data_explore[cat_nans] = cat_imputer.fit_transform(data[cat_nans])

In [None]:
nulls = data_explore.isna().sum()
nan_cols = nulls[nulls>0].index
nan_cols

In [None]:
data_explore.head()

In [None]:
data_explore['MSSubClass'] = data_explore['MSSubClass'].astype(str)

cat_attrs = []
num_attrs = []
columns = list(data_explore.columns)
for col in columns:
    if data_explore[col].dtype=='O':
        cat_attrs.append(col)
    else:
        num_attrs.append(col)

### Statistical Overview

In [None]:
data_explore.describe()

- Since there are 79 variables. Box plot is not best approach to visualize the distribution and analyse the outliers. Lets directly have a look at number of outliers each attribute possess.

In [None]:
Q1 = data_explore.quantile(0.25)
Q3 = data_explore.quantile(0.75)
IQR = Q3 - Q1
outliers = ((data_explore < (Q1 - 1.5 * IQR)) | (data_explore > (Q3 + 1.5 * IQR))).sum()
outliers[outliers>0]

- There is no column having many outliers present in them. EnclosedPorch, BsmtFinSF2, LotFrontage, ScreenPorch are the only attributes having more than 100 outliers.      

### Distribution of Sale Price

In [None]:
data_explore["SalePrice"].hist()

- Clearly, this sales price is skewed towards left with heavy right tail.
- This distibution also indicates that there are very less records of expensive houses. This can be a potential drawback for model, model will fail to accurately predict the price of expensive homes because of less data of expensive houses.

In [None]:
plt.hist(data_explore["SalePrice"].apply(np.log))
plt.show()

### Correlation Plot

In [None]:
plt.figure(figsize=(85, 16))
corr_matrix = data_explore.corr()
sns.heatmap(corr_matrix, mask=np.zeros_like(corr_matrix, dtype=np.bool), square=True, annot=True, cbar=False)
plt.tight_layout()

In [None]:
corr_matrix['SalePrice'].sort_values(ascending=False)

- There are many variables which are correalted with sale price.
- There is also strong correlation among many independent varibles.
- Looking at the attributes, we can see that the sale price is highly correlated with segments such as Garage, various area measurements, number of rooms etc.
- This results doesn't include categorical variables. So there can few of those variable which can be correlated with sale price.

In [None]:
features_to_viz = ['GrLivArea', 'GarageArea', 'TotalBsmtSF']
i=1
plt.style.use("seaborn")
plt.figure(figsize=(15, 6))
for feature in features_to_viz:
    plt.subplot(1, 3, i)
    i=i+1
    plt.scatter(data_explore[feature], data_explore['SalePrice'])
    plt.title("Sale Price Vs "+feature)

- There verticle lines in some graph which indicate that there is absenece of that attribute for the house. Ex. there are houses with no garage or no basement

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='OverallQual', y='SalePrice', data=data_explore)

In [None]:
plt.figure(figsize=(18, 8))
sns.boxplot(x='YearBuilt', y='SalePrice', data=data_explore)
plt.xticks(rotation=90);

- Well not completely but its fair to say that the prices of new houses is more than the old ones.
- The old houses can be still expensive because of various factors such as location, asthetics and quality of house etc.
- Decades after decades, the price of cheap house is seems to increasing. 

In [None]:
plt.scatter(data_explore['GrLivArea'], data_explore['SalePrice'], c=data_explore['TotRmsAbvGrd'], cmap="Set2_r")
plt.title('SalePrice Vs. GrLivArea')
plt.colorbar().set_label('# of Total Rooms Above Ground', fontsize=14)

In [None]:
data_explore['GarageCars'].value_counts()

In [None]:
plt.scatter(data_explore['GrLivArea'], data_explore['SalePrice'], c=data_explore['GarageCars'], cmap="Set2_r")
plt.title('SalePrice Vs. GrLivArea')
plt.colorbar().set_label('Capacity of # Cars in Garage', fontsize=14)

- With more capacity to accomodate the cars in garage, the sale price increases. Since more cars in garage means more garage area. So we can infer that the house with large garage area tend be expensive. 

In [None]:
plt.scatter(data_explore['GrLivArea'], data_explore['SalePrice'], c=data_explore['YearBuilt'].astype('int'), cmap="rainbow")
plt.title('SalePrice Vs. GrLivArea')
plt.colorbar().set_label('YearBuilt', fontsize=14)

- Newer the house, higher is the price and vice-versa.

In [None]:
features_to_viz = ['ExterQual', 'GarageQual', 'KitchenQual', 'FireplaceQu', 'BsmtQual', 'BsmtExposure',]
i=1
plt.figure(figsize=(15, 10))
for col in features_to_viz:
    plt.subplot(3, 2, i)
    sns.boxplot(y=col, x='SalePrice', data=data_explore, orient='h')
    i+=1

- Quality/Condition
    - Ex:	Excellent
    - Gd:	Good
    - TA:	Average/Typical
    - Fa:	Fair
    - Po:	Poor
    - No: Not available

- Better the quality, higher is the price.
- For Garage, there is not so much of difference in house prices.
- Houses with no basement(NA) are chepear.

In [None]:
features_to_viz = ['BldgType', 'HouseStyle', 'Foundation', 'MSZoning',]
i=1
plt.figure(figsize=(15, 10))
for col in features_to_viz:
    plt.subplot(3, 2, i)
    sns.boxplot(y=col, x='SalePrice', data=data_explore, orient='h')
    i+=1

- BldgType: Type of dwelling	
   - 1Fam:	Single-family Detached	
   - 2FmCon:	Two-family Conversion; originally built as one-family dwelling
   - Duplx:	Duplex
   - TwnhsE:	Townhouse End Unit
   - TwnhsI:	Townhouse Inside Unit
   
   
- HouseStyle: Style of dwelling
   - 1Story:	One story
   - 1.5Fin:	One and one-half story: 2nd level finished
   - 1.5Unf:	One and one-half story: 2nd level unfinished
   - 2Story:	Two story
   - 2.5Fin:	Two and one-half story: 2nd level finished
   - 2.5Unf:	Two and one-half story: 2nd level unfinished
   - SFoyer:	Split Foyer
   - SLvl: 	    Split Level
   
   
- Foundation: Type of foundation
   - BrkTil:	Brick & Tile
   - CBlock:	Cinder Block
   - PConc:	Poured Contrete	
   - Slab:	Slab
   - Stone:	Stone
   - Wood:	Wood


- MSZoning: Identifies the general zoning classification of the sale.		
   - A:	Agriculture
   - C:	Commercial
   - FV:	Floating Village Residential
   - I:	Industrial
   - RH:	Residential High Density
   - RL:	Residential Low Density
   - RP:	Residential Low Density Park 
   - RM:	Residential Medium Density


- Buildings of type Single-family Detached & Townhouse End Unit are expensive.
- Two and one-half story: 2nd level finished & Two story houses have mean price around 200,000 USD.
- Houses in Floating Village Residential zone have mean sale price more than 200,000 USD, next expensive zone is Residential High Density.

In [None]:
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
sns.boxplot(y='SaleType', x='SalePrice', data=data_explore)
plt.subplot(1, 2, 2)
sns.boxplot(y='SaleCondition', x='SalePrice', data=data_explore)
plt.show()

- SaleType: Type of sale		
   - WD: 	Warranty Deed - Conventional
   - CWD:	Warranty Deed - Cash
   - VWD:	Warranty Deed - VA Loan
   - New:	Home just constructed and sold
   - COD:	Court Officer Deed/Estate
   - Con:	Contract 15% Down payment regular terms
   - ConLw:	Contract Low Down payment and low interest
   - ConLI:	Contract Low Interest
   - ConLD:	Contract Low Down
   - Oth:	Other
	
    
- SaleCondition: Condition of sale
   - Normal:	Normal Sale
   - Abnorml:	Abnormal Sale -  trade, foreclosure, short sale
   - AdjLand:	Adjoining Land Purchase
   - Alloca:	Allocation - two linked properties with separate deeds, typically condo with a garage unit	
   - Family:	Sale between family members
   - Partial:	Home was not completed when last assessed (associated with New Homes)

- From left side chart, we can see that the sale prices are high for newly constructed homes.
- Now lets see the distribution of highly correlated attributes with sale price.

In [None]:
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
print(cols)
sns.pairplot(data_explore[cols])
plt.show()

- Have a look at 1st row, we can see the upward trend in sale prices for each attribute.
- There is large living area for high quality houses. Not true for all cases, but there is that trend.
- In recent years there are many houses built with large basement.

## Step 3: Data Preprocessing

- Data modifications performed in previous step:
    - Keeping NA as NA for some columns.
    - Replacing some NaN values with mean/most frequent value in column.
    - Converting some numerical or ordinal attributes into categorical attributes.
    - Dropped Id column


- We also observe the distribution of some variable is skewed towards left. I will box-cox transformation to handle such distributions.

In [None]:
X = data.drop(columns=['SalePrice'], axis=1)
y = data['SalePrice'].copy()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
y_log_train = np.log(y_train)
y_log_test = np.log(y_test)

In [None]:
na_cols = ["Alley", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "GarageType", "GarageFinish", "GarageCond", "GarageQual", "PoolQC", "Fence", "MiscFeature"]
cat_attrs = [cat for cat in cat_attrs if not cat in na_cols]
num_attrs.remove('SalePrice')

In [None]:
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer, OneHotEncoder

In [None]:
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="mean")),
                        ('transformer', PowerTransformer(method='yeo-johnson', standardize=True))])

cat_pipeline_1 = Pipeline([('cat_na_fill', SimpleImputer(strategy="constant", fill_value='NA')),
                          ('encoder', OneHotEncoder(handle_unknown='ignore'))])

cat_pipeline_2 = Pipeline([('cat_nan_fill', SimpleImputer(strategy="most_frequent")),
                          ('encoder', OneHotEncoder(handle_unknown='ignore'))])

In [None]:
pre_process = ColumnTransformer([('drop_id', 'drop', ['Id']),
                                ('cat_pipeline_1', cat_pipeline_1, na_cols),
                                ('cat_pipeline_2', cat_pipeline_2, cat_attrs),
                                ('num_pipeline', num_pipeline, num_attrs)], remainder='passthrough')

In [None]:
X_train_transformed = pre_process.fit_transform(X_train)
X_test_transformed = pre_process.transform(X_test)

In [None]:
X_train_transformed.shape, X_test_transformed.shape

In [None]:
oh_na_cols = list(pre_process.transformers_[1][1]['encoder'].get_feature_names(na_cols))
oh_nan_cols = list(pre_process.transformers_[2][1]['encoder'].get_feature_names(cat_attrs))
feature_columns = oh_na_cols+oh_nan_cols + num_attrs

## Step 4: Modelling

In [None]:
from sklearn.model_selection import GridSearchCV, KFold

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

In [None]:
from sklearn.linear_model import ElasticNet

In [None]:
elastic_net_grid_param = [{'l1_ratio': list(np.linspace(0, 1, 10)), 'alpha': [0.0001, 0.005, 0.001, 0.005, 0.01, 0.05, 0.1]}]
elastic_net_grid_search = GridSearchCV(ElasticNet(random_state=42), elastic_net_grid_param, cv=kf, scoring='neg_root_mean_squared_error', return_train_score=True, n_jobs=-1)
elastic_net_grid_search.fit(X_train_transformed, y_log_train)

In [None]:
train_results=[]
train_results.append(['Elastic Net', elastic_net_grid_search.best_params_, -elastic_net_grid_search.best_score_])
elastic_net_grid_search.best_params_, -elastic_net_grid_search.best_score_

In [None]:
best_elastic_net_reg = elastic_net_grid_search.best_estimator_
best_elastic_net_reg

In [None]:
feature_imp = [ col for col in zip(feature_columns, best_elastic_net_reg.coef_)]
feature_imp.sort(key=lambda x:x[1], reverse=True)
feature_imp[:15]

In [None]:
from sklearn.svm import SVR

In [None]:
svr_grid_param = [{'C':list(np.linspace(0.1, 1, 10)), 'epsilon':[0.01, 0.05, 0.1, 0.5, 1]}]
svr_grid_search = GridSearchCV(SVR(kernel="poly", degree=2), svr_grid_param, cv=kf, scoring="neg_root_mean_squared_error", return_train_score=True, n_jobs=-1)
svr_grid_search.fit(X_train_transformed, y_log_train)

In [None]:
train_results.append(['SVR', svr_grid_search.best_params_, -svr_grid_search.best_score_])
svr_grid_search.best_params_, -svr_grid_search.best_score_

In [None]:
best_svr_reg = svr_grid_search.best_estimator_
best_svr_reg

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf_grid_param = [{'max_features':[0.2, 0.4, 0.6, 'auto'], 'max_depth':[8, 12, 16, 20]}]
rf_grid_search = GridSearchCV(RandomForestRegressor(n_estimators=300, random_state=42, n_jobs=-1), rf_grid_param, cv=kf, scoring='neg_root_mean_squared_error', return_train_score=True, n_jobs=-1)
rf_grid_search.fit( X_train_transformed, y_log_train)

In [None]:
train_results.append(['Random Forest', rf_grid_search.best_params_, -rf_grid_search.best_score_])
rf_grid_search.best_params_, -rf_grid_search.best_score_

In [None]:
best_rf_reg = rf_grid_search.best_estimator_
best_rf_reg

In [None]:
feature_imp = [ col for col in zip(feature_columns, best_rf_reg.feature_importances_)]
feature_imp.sort(key=lambda x:x[1], reverse=True)
feature_imp[:15]

In [None]:
from xgboost import XGBRegressor

In [None]:
xgb_grid_parm=[{'max_depth':[4, 6, 8, 12], 'subsample':[0.5, 0.75, 1.0]}]
xgb_grid_search = GridSearchCV(XGBRegressor(objective='reg:squarederror', n_estimators=300, learning_rate=0.1, random_state=42, n_jobs=-1), xgb_grid_parm, cv=kf, scoring="neg_root_mean_squared_error", return_train_score=True, n_jobs=-1)
xgb_grid_search.fit(X_train_transformed, y_log_train)

In [None]:
train_results.append(['XGBoost', xgb_grid_search.best_params_, -xgb_grid_search.best_score_])
xgb_grid_search.best_params_, -xgb_grid_search.best_score_

In [None]:
cvres = xgb_grid_search.cv_results_
for train_mean_score, test_mean_score, params in zip(cvres["mean_train_score"], cvres["mean_test_score"], cvres["params"]):
    print(-train_mean_score, -test_mean_score, params)

In [None]:
best_xgb_reg = xgb_grid_search.best_estimator_
best_xgb_reg

In [None]:
feature_imp = [ col for col in zip(feature_columns, best_xgb_reg.feature_importances_)]
feature_imp.sort(key=lambda x:x[1], reverse=True)
feature_imp[:15]

In [None]:
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import ElasticNetCV

In [None]:
base_estimators = [('elastic_net', best_elastic_net_reg), ('svr', best_svr_reg), ('rf', best_rf_reg), ('xgb', best_xgb_reg)]

stack_reg = StackingRegressor(estimators=base_estimators, final_estimator=ElasticNetCV(l1_ratio=[.1, .5, .7, .9, .95, .99, 1], random_state=42), cv=kf, passthrough=False, n_jobs=-1)
stack_reg.fit(X_train_transformed, y_log_train)

In [None]:
from sklearn.model_selection import cross_val_score

stack_rmse_scores = cross_val_score(stack_reg, X_train_transformed, y_log_train, scoring='neg_root_mean_squared_error', cv=kf, n_jobs=-1)
stack_rmse = np.round(np.mean(-stack_rmse_scores), 4)
train_results.append(['Stacking', '', stack_rmse])

## Step 5: Model Evaluation

In [None]:
pd.set_option('display.max_colwidth', -1)

train_models_df = pd.DataFrame(train_results, columns=['Model', 'Best Paramas', 'RMSLE'])
train_models_df

In [None]:
results = dict()
best_models = [best_elastic_net_reg, best_svr_reg, best_rf_reg, best_xgb_reg, stack_reg]
model_names = []
model_rmse = []

for model in best_models:
    test_rmse_scores = cross_val_score(model, X_test_transformed, y_log_test, scoring='neg_root_mean_squared_error', cv=kf, n_jobs=-1)
    test_rmse_scores = np.round(-test_rmse_scores,4)
    test_rmse = np.round(np.mean(test_rmse_scores),4)
    model_names.append(model.__class__.__name__)
    model_rmse.append(test_rmse)

In [None]:
def plot_results(model_names, model_rmse):
        
    plt.figure(figsize=(12, 5))
    x_indexes = np.arange(len(model_names))     
    width = 0.15                            
    
    plt.barh(x_indexes, model_rmse)
    for i in range(len(x_indexes)):
        plt.text(x=model_rmse[i], y=x_indexes[i], s=str(model_rmse[i]), fontsize=12)
    
    plt.xlabel("Mean RMSLE", fontsize=14)
    plt.yticks(ticks=x_indexes, labels=model_names, fontsize=14)
    plt.title("Results on Test Dataset")
    plt.show()

In [None]:
plot_results(model_names, model_rmse)

- Now lets see how model is performing on overall dataset. This will help us to understand where the model is underperforming.

In [None]:
best_model = best_models[np.argmin(model_rmse)]
best_model

In [None]:
y_train_pred = best_model.predict(X_train_transformed)
y_test_pred = best_model.predict(X_test_transformed)
y_train_pred = np.exp(y_train_pred)
y_test_pred = np.exp(y_test_pred)
predicted = np.concatenate([y_train_pred, y_test_pred], axis=0)
obsereved = np.concatenate([y_train, y_test], axis=0)

In [None]:
combine_data = pd.concat([X_train, X_test], axis=0)
combine_data['SalePrice'] = obsereved
combine_data['Predicted_SalePrice'] = predicted
combine_data.shape

In [None]:
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
ax = combine_data['SalePrice'].hist()
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.subplot(1, 2, 2)
ax = combine_data['Predicted_SalePrice'].hist()
for p in ax.patches:
        ax.annotate('{}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
plt.scatter(combine_data['GrLivArea'], combine_data['SalePrice'], label="Observed")
plt.scatter(combine_data['GrLivArea'], combine_data['Predicted_SalePrice'] , c='green', label="Predicted")
plt.xlabel('GrLivArea')
plt.ylabel('Sale Price')
plt.legend()
plt.show()

- Model does well when predicting the sale prices for houses with living area below 2000 square feet.
- For houses with living area above 2000 square feet, there is considerable difference in observed value and predicted value.
- Look at botton left, we can see that there are some houses with less sale price, model has predicted bit higher prices for those houses. 

## Step 6: Make Submission

In [None]:
final_model = Pipeline([('pre_process', pre_process),
                       ('best_model', best_model)])
final_model.fit(X_train, y_log_train)

In [None]:
test_data = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")

In [None]:
log_predictions = final_model.predict(test_data)
predictions = np.exp(log_predictions)

In [None]:
test_predictions = pd.DataFrame(test_data['Id'])
test_predictions['SalePrice'] = predictions.copy()

In [None]:
test_predictions.head()

In [None]:
test_predictions.to_csv("./submission.csv", index=False)